B33080 Humanities Computing: Electronic Text
University of Antwerp, Campus Drie Eiken
Second Term 2005
Edward Vanhoutte
edward.vanhoutte@kantl.be

B33080 Humanities Computing: Electronic Text
Week 4: XML theory and practice: valid XML - parsing/validating - TeixLite.
University of Antwerp, Campus Drie Eiken
Monday 28 February
Edward Vanhoutte
III. Week 4
→ write a DTD for the following document and save it as c:\lab\email.dtd
<?xml version="1.0" ?>
<email>
<header>
<subj>my dog</subj>
<date>Mon, 7 Feb 2004 13:57:24 +0100</date>
<from>"thedoglady@wanteddogs.org"
<thedoglady@wanteddogs.org></from>
<to>"edward.vanhoutte@kantl.be" <edward.vanhoutte@kantl.be></to>
</header>
<body>
<open>Sir,</open>
<p>Last week I lost my dog and don't know where she is now.</p>
<p>Normally I keep it in a bag under my arm, but now she's gone.</p>
<p>Could you please check all your bags and look whether you can find my dog?</p>
<close>Very many thanks in advance</close>
<sign>
<name>Lady D. Og</name>
<address>
<addrLine>Department of lost dogs</addrLine>
<addrLine>Ministry of dogs and bags, Dogtown</addrLine>
<addrLine>thedoglady@wanteddogs.org</addrLine>
</address>
</sign>
</body>
</email>
<!ELEMENT email (header, body)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (addrline)+>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (addrline)+>
<!ELEMENT addrline (#PCDATA)>
XML can be validated when we have:
- an XML document
- a DTD
- a validating parser
Make sure:
- the Doctype Declaration inside the XML document refers to the appropriate DTD and its path
- the parser can find an XML declaration
Validating with NSGMLS (SP) | |
check whether the following files are in the bin folder of the program, and download from
Validating with NSGMLS (SP) | |
- Specify where nsgmls can find the catalog file under Options in the toolbar of runsp2.
- Specify where nsgmls can find xml.dcl under Options in the toolbar of runsp2.
- http://www.kantl.be/ctb/vanhoutte/teach/hc2005.htm (week 4)
- Copy email.xml to C:\lab\
- Copy email.dtd to C:\lab\
- Copy iso-lat1.ent, iso-lat2.ent, iso-num.ent & iso-pub.ent to C:\lab\
- Validate with NSGMLS (SP)
- CD-ROM
- Copy exx/email.xml to C:\lab\
- Copy exx/email.dtd to C:\lab\
- Copy exx/iso-lat1.ent, exx/iso-lat2.ent, exx/iso-num.ent & exx/iso-pub.ent to C:\lab\
- Validate with NSGMLS (SP)
- Validate with Open XML Editor
- Hike your neighbour's computer
- Sneak some mistakes in his/her email.xml document
- Let him/her validate and look for the mistakes
V. TEI
Text Encoding Initiative (TEI) | |
provides DTD subsets and elements for the encoding of:
- texts in any natural language
- of any date
- in any literary genre or text type
- without restrictions on form or content
They treat both continuous materials ('running text') and discontinuous materials such as dictionaries and linguistic corpora.
Text Encoding Initiative: a pizza model | |
The TEI has produced a number of DTD subsets which can be combined according to the needs for a particular project in the humanities.
"All pizza's have some ingredients in common (cheese and tomato sauce); in Chicago, at least, they may have entirely different forms of pastry base, with which (universally) the consumer is expected to make his or her own selection of toppings."
- Core tag sets (cheese and tomato): define mandatory elements for all document types.
- Base tag sets (the pastry): define the structural components of a document.
- Additional tag sets (toppings): can occur in all document type classes, but define specialised tag sets which can be combined according to taste.
→ Are always required and contain the teiHeader DTD and elements available in all TEI documents.
→ Define the basic building blocks of different text types. Following selections are available:
- Prose: this tagset is suitable for most documents most of the time;
- Verse: this tagset adds specialist tagging for metrical analysis, rhyme-scheme etc. to the basic verse markup already included in the core;
- Drama: this tagset adds specialist tagging for cast lists, records of first performance, etc. to the basic drama markup already included in the core;
- Speech: this tagset replaces the basic structure by one suitable for linguistic analysis of speech acts, etc.;
- Dictionaries: this tagset replaces the basic structure with one containing detailed lexicographic features;
- Terminology: this tagset replaces the basic structure with one specific to terminological databases;
- General base: this tagset allows you to combine tags from different base tagsets, with the proviso that any single text division can contain tags from only one of the base tagsets you choose from the following list: prose, verse, drama, spoken texts, dictionaries, terminology:
- Mixed base: this tagset allows you to combine tags from different base tagsets, with no restriction at all as to where tags from different base tagsets can appear. The different tagsets to combine are: prose, verse, drama, spoken texts, dictionaries, terminology.
→ May be selected and are optional:
- linking: adds elements for hypertext linking, segmentation, and alignment;
- figures: adds elements for encoding tables, pictures, and formulae;
- analysis: adds elements for interpretation and simple linguistic analyses;
- fs: adds elements for feature structure analysis;
- certainty: adds elements for recording uncertainty and responsibility;
- transcr: adds elements for the transcription of primary sources (e.g. manuscripts);
- textcrit: adds elements for text-critical apparatus;
- names.dates: adds elements for the detailed tagging of names and dates;
- nets: adds elements for recording the abstract structure of mathematical graphs, networks, and trees;
- corpora: adds specialized elements to the TEI-header for use with language corpora.
- one of many possible views of the TEI DTD
- small and simple
- 20% of the tags, 80% of the projects
- 121 elements
- was devised as a didactic stepstone to the full flavour TEI, but began a life of its own
- realistic for existing texts and for document production
TeixLite: the XML compatible version of TEILite
→ "TEI U5: Encoding for Interchange: an introduction to the TEI."
A TeixLite document is an XML document which refers to a DTD, so:
<?xml version="1.0"?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"teixlite.dtd"
[
]>
A TeixLite document is an XML document which refers to a DTD, so:
<?xml version="1.0"?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"teixlite.dtd"
[
]>
→ root element = <TEI.2>, so:
<?xml version="1.0"?>
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite XML ver. 1//EN"
"teixlite.dtd"
[
]>
<TEI.2>
...
</TEI.2>
Copy teixlite.dtd to C:\lab\
A TEI-conformant document | |
comprises a header followed by a text
<TEI.2>
<teiHeader>...<teiHeader>
<text>...<text>
<TEI.2>
Copy teixlite.clb to C:\program files\NoteTab Light\Libraries\
The header is essential for:
- bibliographic control and identification
- resource documentation and
- processing (see later)
The TEI Header is introduced by the element <teiHeader> and has 4 major parts, only the first of which is mandatory:
- file description <fileDesc>:> contains a full bibliographic description of an electronic file amongst which information about the sources from which the electronic text was derived. Essential for bibliographic referencing and cataloguing.
- encoding description <encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived. It allows for documenting detailed information about transcription/transliteration principles such as normalization, the treatment of quotations and hyphenation and the levels of interpretation i.e. analytic tagging and encoding applied to the document.
- profile description <profileDesc>: provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their settings.
- revision description <revisionDesc>: summarizes the revision history for a file, which is important for version control and for resolving questions about the history of a file, especially when a team of scholars is working on the same document.
The full form of a TEI Header is thus:
<teiHeader>
<fileDesc> ... </fileDesc>
<encodingDesc> ... </encodingDesc>
<profileDesc> ... </profileDesc>
<revisionDesc> ... </revisionDesc>
</teiHeader>
While a minimal header takes the form:
<teiHeader>
<fileDesc> ... </fileDesc>
</teiHeader>
<text>: A text may be unitary or composite | |
A unitary text contains
- <front>: front matter
- <back>: back matter
- <body>: a body
<text>
<front>...</front>
<body>...</body>
<back>...</back>
</text>
<text>: A text may be unitary or composite | |
A unitary text contains
- <front>: front matter
- <back>: back matter
- <body>: a body
<text>
<front>...</front>
<body>...</body>
<back>...</back>
</text>
In a composite text, the body is a
- <group>: group of texts (or nested groups)
<group>
<text>...</text>
<text>...</text>
<text>...</text>
</group>
Preliminary material such as title pages, prefatory epistles, etc.,
<front>
<titlePage>
<docTitle>
<titlePart type="main">
...
</titlePart>
<docAuthor>...</docAuthor>
<docDate>...</docDate>
<docEdition>...</docEdition>
<docImprint>...</docImprint>
<epigraph>...</epigraph>
</docTitle>
</titlePage>
</front>
<titlePage>
<docTitle>
<titlePart>
<title level="m" type="main">DE TELEURGANG VAN DEN WATERHOEK</title>
</titlePart>
</docTitle>
<titlePart>DOOR</titlePart>
<docAuthor>STIJN STREUVELS</docAuthor>
<docImprint>UITGAVE "EXCELSIOR" — BRUGGE</docImprint>
<docImprint>AMSTERDAM, L. J. VEEN, UITGEVER.</docImprint>
</titlePage>
Structure of a TEI Document | |
A text usually has divisions | |
- generic, hierarchic subdivisions
- vanilla or numbered
- type attribute
- associated <head> and <trailer>
- <div>, <div0>, <div1>, <div2>, <div3>, <div4>, <div5>, <div6>, <div7>,
<text>
<front> <!-- titlepage etc here --> </front>
<body>
<div1 type="book" n="1" id="b0100">
<head>Book1</head>
<div2 type="chapter" n="1" id="b0101">
<head>Chapter 1</head>
<!-- rest of the chapter -->
</div2>
<div2 type="chapter" n="2" id="b0102">
<head>Chapter 2</head>
<!-- rest of the chapter -->
</div2>
</div1>
</body>
</text>
Applicable to all elements
- id for unique identification
- n for (non-unique) name or number
- rend for rendition (appearance)
- lang for language and hence writing-system
→ Extensible, like other classes
Text components in TEI Lite | |
What are divisions made of?
- Prose is mostly paragraphs (<p>)
- Verse is mostly lines (<l>), sometimes in hierarchic groups (<lg>)
- Drama is mostly speeches (<sp>) containing <p> or <l> and interspersed with stage directions (<stage>)
These may be mixed, and may appear also directly within undivided texts
<p>Initially launched in 1987, the TEI is an international and interdisciplinary
standard that helps libraries, museums, publishers, and individual scholars represent
all kinds ofliterary and linguistic texts for online research and teaching, using an
encoding scheme that is maximally expressive and minimally obsolescent.</p>
<p>For current membership of the TEI Consortium, please check
the members list.</p>
<lg type="poem">
<head>Poor Edward</title>
<lg type="stanza">
<l>Did you hear the news about Edward?</l>
<l>On the back of his head he had another face</l>
<l>Was it a woman's face or a young girl?</l>
<l>They said to remove it would kill him</l>
<l>So poor Edward was doomed</l>
</lg>
<lg type="stanza">
<l>The face could laugh and cry</l>
<l>It was his devil twin</l>
<l>And at night she spoke to him</l>
<l>Things heard only in hell</l>
<l>But they were impossible to separate</l>
<l>Chained together for life</l>
</lg>
<lg type="stanza">
<l>Finally the bell tolled his doom</l>
<l>He took a suite of rooms</l>
<l>And hung himself and her from the balcony irons</l>
<l>Some still believe he was freed from her</l>
<l>But I knew her too well</l>
<l>I say she drove him to suicide</l>
<l>And took poor Edward to hell</l>
</lg>
</lg>
<stage>Enter Beatrice</stage>
<sp who="Beatrice"><l>Against my will I am sent to bid you come in to dinner.</l></sp>
<sp who="Benedick"><l>Fair Beatrice, I thank you for your pains.</l></sp>
<sp who="Beatrice"><l>I took no more pains for these thanks,
than you took pains to to thank me; if it had
been painful I would not have come.</l></sp>
<sp who="Benedick"><l>You take pleasure, then, in the message?</l></sp>
<sp who="Beatrice"><l>Yea, just so much as you may take
upon a knife's point, and choke a daw withal
- You have no stomach, signior; fare you well.</l></sp>
<stage>Exit.</stage>
Enter Beatrice
Beatrice: Against my will I am sent to bid you come in to dinner.
Benedick: Fair Beatrice, I thank you for your pains.
Beatrice: I took no more pains for these thanks,
than you took pains to to thank me; if it had
been painful I would not have come.
Benedick: You take pleasure, then, in the message?
Beatrice: Yea, just so much as you may take
upon a knife's point, and choke a daw withal
- You have no stomach, signior; fare you well.
Exit.
→ Much Ado About Nothing, 2.3
- <pb />: pagebreak
- <lb />: linebreak
Whan that Aprill with his shoures soote<lb />
The droghte of March hath perced to the roote,<lb />
And bathed every veyne in swich licour<lb />
Of which vertu engendred is the flour;<lb />
<pb ed="riverside" n="23" />
Whan Zephirus eek with his sweete breeth<lb />