Encoding texts for humanities research:
History, method, and tools

University College London

4 December 2008

Edward Vanhoutte

edward.vanhoutte@kantl.be

Royal Academy of Dutch Language and Literature

TOC | First


I. Introduction

How I could have appeared

previous table of contents next
(c) Edward Vanhoutte 1 of 2 [121]

Who is Edward Vanhoutte?

previous table of contents next
(c) Edward Vanhoutte 2 of 2 [121]
  • Director of Research in the Royal Academy of Dutch Language and Literature in Belgium
  • Director of the Centre for Scholarly Editing and Document Studies (CTB) http://www.kantl.be/ctb/
  • Manages ca. 10 electronic projects in the humanities.
  • Associate Editor of LLC. The Journal of Digital Scholarship in the Humanities http://llc.oxfordjournals.org/
  • Former member of the TEI technical council
  • Editor of the DALF guidelines for the description and encoding of modern manuscript material (with Ron Van den Branden)
  • Has taught courses on XML and Electronic textual editing in Europe and Africa
  • Blog The Mind Tool: Musings and research notes on Humanities Computing http://edwardvanhoutte.blogspot.com
  • URI: http://www.edwardvanhoutte.org
  • email: edward.vanhoutte@kantl.be

II. Objectives

Objectives

previous table of contents next
(c) Edward Vanhoutte 1 of 1 [121]
After this lecture you should be able to
  • Explain the difference between the two flavours of XML
  • Know what a DTD is
  • Know what a Schema is
  • Explain what TEI stands for
  • Assess the value of TEI for your project
  • Discuss the importance of text encoding for the humanities

III. XML, DTD, validation

Overview

previous table of contents next
(c) Edward Vanhoutte 1 of 67 [121]
  • Part 1: Theory
    • XML: recapitulation
    • DTD and Schema
    • TEI
    • TEI Consortium
    • Teilite
  • Part 2: Demonstration
    • Electronic scholarly editions
    • Digital Archive of Letters in Flanders
    • James Joyce Timeline: Finnegans Wake Notebooks
    • TEI by Example

Who owns a bike?

previous table of contents next
(c) Edward Vanhoutte 2 of 67 [121]

Who owns a bike?

previous table of contents next
(c) Edward Vanhoutte 3 of 67 [121]
  • Can you describe it?
  • What make?
  • What do you use it for?

French bike

previous table of contents next
(c) Edward Vanhoutte 4 of 67 [121]

Nigerian bike

previous table of contents next
(c) Edward Vanhoutte 5 of 67 [121]

Portuguese bike

previous table of contents next
(c) Edward Vanhoutte 6 of 67 [121]

Estonian bike

previous table of contents next
(c) Edward Vanhoutte 7 of 67 [121]

Vietnamese bike

previous table of contents next
(c) Edward Vanhoutte 8 of 67 [121]

American bike

previous table of contents next
(c) Edward Vanhoutte 9 of 67 [121]

Old bike

previous table of contents next
(c) Edward Vanhoutte 10 of 67 [121]

Bike of the future

previous table of contents next
(c) Edward Vanhoutte 11 of 67 [121]

Application: Where can a bike be used?

previous table of contents next
(c) Edward Vanhoutte 12 of 67 [121]
  • Geographically
  • Topographically

Application: When can a bike be used?

previous table of contents next
(c) Edward Vanhoutte 13 of 67 [121]
  • Time
  • Age
  • Purpose

Use: Who can use a bike?

previous table of contents next
(c) Edward Vanhoutte 14 of 67 [121]
  • Gender
  • Race
  • Religion
  • Ability

Conclusion

previous table of contents next
(c) Edward Vanhoutte 15 of 67 [121]
The concept 'bicycle' is generally understood as naming a general means of transportation that can be used
  • by everyone
  • everywhere
  • for every purpose

Even this is recognized as a bike

previous table of contents next
(c) Edward Vanhoutte 16 of 67 [121]

Who owns a text?

previous table of contents next
(c) Edward Vanhoutte 17 of 67 [121]

Who owns a text?

previous table of contents next
(c) Edward Vanhoutte 18 of 67 [121]
  • Can you describe it?
  • What make?
  • What do you use it for?

French text

previous table of contents next
(c) Edward Vanhoutte 19 of 67 [121]

Portuguese text

previous table of contents next
(c) Edward Vanhoutte 20 of 67 [121]

Estonian text

previous table of contents next
(c) Edward Vanhoutte 21 of 67 [121]

Vietnamese text

previous table of contents next
(c) Edward Vanhoutte 22 of 67 [121]

American text

previous table of contents next
(c) Edward Vanhoutte 23 of 67 [121]

Old text

previous table of contents next
(c) Edward Vanhoutte 24 of 67 [121]

Modern texts

previous table of contents next
(c) Edward Vanhoutte 25 of 67 [121]

Application: Where can a text be used?

previous table of contents next
(c) Edward Vanhoutte 26 of 67 [121]
  • Geographically
  • Topographically

Application: When can a text be used?

previous table of contents next
(c) Edward Vanhoutte 27 of 67 [121]
  • Time
  • Age
  • Purpose

Use: Who can use a text?

previous table of contents next
(c) Edward Vanhoutte 28 of 67 [121]
  • Gender
  • Race
  • Religion
  • Ability

Conclusion

previous table of contents next
(c) Edward Vanhoutte 29 of 67 [121]
The concept 'text' is generally understood as naming a general means of transportation of language that can be used
  • by everyone
  • everywhere
  • for every purpose

Even this is recognized as a text

previous table of contents next
(c) Edward Vanhoutte 30 of 67 [121]

Concept: What constitutes a bike?

previous table of contents next
(c) Edward Vanhoutte 31 of 67 [121]

What constitutes a text?

previous table of contents next
(c) Edward Vanhoutte 32 of 67 [121]

BUT

previous table of contents next
(c) Edward Vanhoutte 33 of 67 [121]
  • Proprietory
  • Incompatibility
  • Paying for licences
  • Procedural markup

If we want to be able do ride on the right and the left side of the road

previous table of contents next
(c) Edward Vanhoutte 34 of 67 [121]
  • Non-proprietory
  • Compatibility
  • Open Source & free
  • Descriptive markup
→ XML

Why would you want to learn about XML?

previous table of contents next
(c) Edward Vanhoutte 35 of 67 [121]
→ XML is not the end of the world, and won't solve all your problems, but:
  • It's a good approximation
  • It works (fairly well)
  • It's widely supported
  • It's a W3C recommendation

Why would you want to learn about XML?

previous table of contents next
(c) Edward Vanhoutte 36 of 67 [121]
→ XML will not live forever, but:
  • Migration will be supported
  • Migration will create more jobs

Why would you want to learn about XML?

previous table of contents next
(c) Edward Vanhoutte 37 of 67 [121]
→ It's fun.

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 38 of 67 [121]
Five essential components

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 39 of 67 [121]
  • Processing Instructions
    • <?xml version="1.0" ?>
    • <?xml-stylesheet href="../dtd/xsltslides.xsl" type="text/xsl" ?>

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 40 of 67 [121]
  • Elements
    • <title> or </title>
    • <empty />
    • → XML is case sensitive in the naming of the elements.
      <title> is not <TITLE> is not <Title> is not <tItle> etc.

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 41 of 67 [121]
  • Attributes (optional)
    • <title type="journal" >
    • <name who="Edward" reg="VanhoutteE" >

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 42 of 67 [121]
  • Entity References
    • to represent characters which cannot reliably be typed in (ISO 8859-1: IsoLat1, ISO 10646 - Unicode): &eacute; = &#233; = é
    • as a shortcut for boiler plate text: &tomatorelish; = reference to an external recipe (text) for tomato relish
    • containers for external (non-XML) data such as graphics: <figure entity="ascii-full" />

What does an XML document contain of?

previous table of contents next
(c) Edward Vanhoutte 43 of 67 [121]
  • CDATA:
    • allows you to include application code (JavaScript, Perl, BASIC, etc.) in an XML document without having to worry about escaping characters. A CDATA section is treated by the XML processor as it appears
    • <![CDATA[This text escapes the XML processor]]>

XML comes in 2 flavours

previous table of contents next
(c) Edward Vanhoutte 44 of 67 [121]

XML comes in 2 flavours

previous table of contents next
(c) Edward Vanhoutte 45 of 67 [121]
  1. Well-formed XML
  2. Valid XML

Well-formed XML

previous table of contents next
(c) Edward Vanhoutte 46 of 67 [121]

Well-formed XML

previous table of contents next
(c) Edward Vanhoutte 47 of 67 [121]
  1. XML is case sensitive
  2. At least one element
  3. There is always a root element
  4. All logical and physical structures nest properly
  5. Correspondence of element names in start- and end-tags
  6. Attribute names only appear once in a start-tag
  7. Attribute values are quoted
  8. Attribute values do not refer to external entities
  9. Entities are declared
  10. No entity reference contains the name of a non-parsed entity

Valid XML

previous table of contents next
(c) Edward Vanhoutte 48 of 67 [121]

Valid XML

previous table of contents next
(c) Edward Vanhoutte 49 of 67 [121]
  • Well-formed XML
  • Conforms to a Document Type Definition (DTD)
  • Or to a valid Schema

Document Type Definition

previous table of contents next
(c) Edward Vanhoutte 50 of 67 [121]
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
    <salutation>hello</salutation>
    <target>world</target>
</greeting>
               

Document Type Definition

previous table of contents next
(c) Edward Vanhoutte 51 of 67 [121]
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
    <salutation>hello</salutation>
    <target>world</target>
</greeting>
               
DTD:
<!ELEMENT greeting (salutation, target)>
<!ELEMENT salutation (#PCDATA)>
<!ELEMENT target (#PCDATA)> 

Document Type Definition

previous table of contents next
(c) Edward Vanhoutte 52 of 67 [121]
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
  • names for all your elements
  • names and default values for their attributes
  • rules about how elements can nest
  • names for re-usable pieces of data (entities)
  • and a few other things
A DTD does not specify anything about what elements "mean"

Defining an element in a DTD

previous table of contents next
(c) Edward Vanhoutte 53 of 67 [121]
<!ELEMENT name contentModel>
               
  • name is the name of the element (GI)
  • contentModel defines valid content for the element:
    • #PCDATA
    • EMPTY
    • other elements
    • mixed content combining PCDATA and other elements

previous table of contents next
(c) Edward Vanhoutte 54 of 67 [121]

previous table of contents next
(c) Edward Vanhoutte 55 of 67 [121]

previous table of contents next
(c) Edward Vanhoutte 56 of 67 [121]

email DTD

previous table of contents next
(c) Edward Vanhoutte 57 of 67 [121]
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (addrline)+>
<!ELEMENT addrline (#PCDATA)> 

Theoretical problem with the DTD

previous table of contents next
(c) Edward Vanhoutte 58 of 67 [121]
  • DTDs are written in a formal language different from the XML documents.
→ XML Schema
  • A self-declarative way of documenting the formal specification for the structure of an XML document
  • Expressed in XML itself
Schema can deal with namespaces and DTDs can not
An XML namespace is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names.
e.g.: <table> in doc1 is not necessarily the same as <table> in doc2

Problems with Schema

previous table of contents next
(c) Edward Vanhoutte 59 of 67 [121]
→There are a couple of Schema languages around:
  • W3C XML Schema: http://www.w3.org/XML/Schema
  • Relax NG: http://www.relaxng.org
  • XML-Data (XDR)
  • Document Content Description (DCD)
  • Schema for Object-oriented XML (SOX)
  • Document Definition Markup Language (DDML)
  • Schematron
  • Datatypes for DTDs (DT4DTD)
  • Document Structure Description (DSD)
  • Regular Language Description for XML (RELAX)
  • TREX (Tree Regular Expressions for XML)
  • Examplotron
  • Hook
  • Document Schema Definition Language (DSDL)
  • STEP/EXPRESS and XML

Book DTD

previous table of contents next
(c) Edward Vanhoutte 60 of 67 [121]

<!ELEMENT   book    (chapter+)>
<!ELEMENT   chapter (section+)>
<!ELEMENT   section (p+)>
<!ELEMENT   p       (#PCDATA)>

               

Book Schema (Relax NG Simple notation)

previous table of contents next
(c) Edward Vanhoutte 61 of 67 [121]

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
>

 <start>
      <ref name="book" />
  </start>

  <define name="book">
    <element>
      <name>book</name>
      <oneOrMore>
        <ref name="chapter" />
      </oneOrMore>
    </element>
  </define>

 <define name="chapter">
    <element>
      <name>chapter</name>
      <oneOrMore>
        <ref name="section" />
      </oneOrMore>
    </element>
 </define>

 <define name="section">
    <element>
      <name>section</name>
      <oneOrMore>
        <ref name="p" />
      </oneOrMore>
    </element>
 </define>

 <define name="p">
    <element>
      <name>p</name>
      <text />
    </element>
 </define>

</grammar>

               

DTD and Schema

previous table of contents next
(c) Edward Vanhoutte 62 of 67 [121]

<!ELEMENT   book   (chapter+)>
                        

<define name="book">
   <element>
      <name>book</name>
      <oneOrMore>
         <ref name="chapter" />
      </oneOrMore>
   </element>
</define>
                        

<!ELEMENT   chapter   (section+)>
                        

<define name="chapter">
   <element>
      <name>chapter</name>
      <oneOrMore>
         <ref name="section" />
      </oneOrMore>
   </element>
</define>
                        

<!ELEMENT   section   (p+) >
                        

<define name="section">
   <element>
      <name>section</name>
      <oneOrMore>
         <ref name="p" />
      </oneOrMore>
   </element>
</define>
                        

<!ELEMENT  p   (#PCDATA)>
                        

<define name="p">
   <element>
      <name>p</name>
      <text />
   </element>
</define>

                        

Why a DTD/Schema?

previous table of contents next
(c) Edward Vanhoutte 63 of 67 [121]
  • A DTD/Schema is very useful at data preparation time: validating editors only allow you to input correct markup
  • Useful for consistent encoding in projects
  • Guarantees longevity and interchangeability of semantics and structure in encoded texts
  • We need it for validation
  • Useful for software development and operability

Valid XML

previous table of contents next
(c) Edward Vanhoutte 64 of 67 [121]
A valid XML document will reference a Document Type Declaration
A Document Declaration contains the reference to a Document Type Definition (DTD) or a Schema
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite 1.0//EN"
 "../dtd/teixlite.dtd" [ ]>
               

Validating XML

previous table of contents next
(c) Edward Vanhoutte 65 of 67 [121]
XML can be validated when we have:
  • an XML document
  • a DTD or Schema
  • a validating parser
Make sure:
  • the Doctype Declaration inside the XML document refers to the appropriate DTD or Schema and its path
  • the parser can find an XML declaration

Writing DTDs or Schema

previous table of contents next
(c) Edward Vanhoutte 66 of 67 [121]
→ Bad news
  • Difficult
  • Time Consuming
  • Expensive

Writing DTDs or Schema

previous table of contents next
(c) Edward Vanhoutte 67 of 67 [121]
→ Bad news
  • Difficult
  • Time Consuming
  • Expensive
→ Good news
  • The work has been done for you

IV. TEI

Text Encoding Initiative (TEI)

previous table of contents next
(c) Edward Vanhoutte 1 of 51 [121]
provides DTD subsets, Schema modules, and elements for the encoding of:
  • texts in any natural language
  • of any date
  • in any literary genre or text type
  • without restrictions on form or content
They treat both continuous materials ('running text') and discontinuous materials such as dictionaries and linguistic corpora.

The use of computers in the humanities

previous table of contents next
(c) Edward Vanhoutte 2 of 51 [121]
  • 1949-ca. 1970:
    • ad hoc programming in projects
    • higher programming languages
  • ca. 1970-1985:
    • method oriented program packages
    • SPSS (Statistical Package for the Social Sciences)
    • OCP (Oxford Concordance Program)
  • 1985-1997
    • PC revolution
    • standard software
    • dBase, Access
    • SGML ISO 8879:1986
  • 1997-now
    • web / XML orientation
    • computer as presentational medium
    • web tool

The Text Encoding Initiative (TEI)

previous table of contents next
(c) Edward Vanhoutte 3 of 51 [121]
11-12 November 1987: Vassar College, Poughkeepsie (NY)
→ 32 humanities scholars
Principles:
  • Platform-independent
  • Software-independent
  • Endurability
  • Re-usability
  • Accessibility
  • Language-independent
  • For all of the Humanities disciplines
→ SGML ISO 8879:1996
→ Sperberg-McQueen, C.M.. and Burnard, L. (eds.) (2002). TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML Version (Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.)
→ TEI Consortium (eds.) (2007). TEI P5. Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/

TEI Short History

previous table of contents next
(c) Edward Vanhoutte 4 of 51 [121]
  • 1987: Vassar College, Poughkeepsie
  • 1990: P1 : SGML
  • 1992: P2 : SGML
  • 1994: P3 : SGML : 600 + elements
  • 1995: TeiLite : SGML : 131 elements
  • 1999: P3rev : SGML
  • 2000: TEI Consortium
  • 2001: P4 : XML comp
  • 2001: TeixLite : XML
  • 2005: P5 : XML
  • 2007: P5 version 1

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 5 of 51 [121]

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 6 of 51 [121]

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 7 of 51 [121]

Markup Principles

previous table of contents next
(c) Edward Vanhoutte 8 of 51 [121]
From the Poughkeepsie Principles the TEI concluded that the TEI Guidelines should:
  • Provide a standard format for data interchange;
  • Provide guidance for encoding of texts in this format;
  • Support the encoding of all kinds of features of all kinds of texts studied by researchers;
  • Be application independent.

What does the TEI offer you?

previous table of contents next
(c) Edward Vanhoutte 9 of 51 [121]
The TEI has produced a number of DTD subsets/Schema fragments which can be combined according to the needs for a particular project in the humanities.
  • Vocabulary (elements)
  • Tuning tools (attributes)
  • Syntax (content models and nesting rules)
  • Modification and extension guidelines

For which purposes?

previous table of contents next
(c) Edward Vanhoutte 10 of 51 [121]
  • Edit texts (e.g. word processors, syntax-directed editors)
  • Edit, display, and link texts in hypertext systems
  • Format and print texts using desktop publishing systems, or batch-oriented formatting programs
  • Load texts into free-text retrieval databases or conventional databases
  • Unload texts from databases as search results or for export to other software
  • Search texts for words or phrases
  • Perform content analysis on texts
  • Collate texts for critical editions
  • Scan texts for automatic indexing or similar purposes
  • Parse texts linguistically
  • Analyze texts stylistically
  • Scan verse texts metrically
  • Ink text and images
→ The aim has been to make the TEI Guidelines useful for encoding the same texts for different purposes.

How does the TEI offer this to you?

previous table of contents next
(c) Edward Vanhoutte 11 of 51 [121]
  • Website http://www.tei-c.org
  • On-line reference documentation: Guidelines
  • Print reference documentation: Guidelines
  • On-line schema/DTD generator: ROMA
  • Free software
  • Free stylesheets
  • Community of practitioners: TEI-L maillist
  • Experience of projects
  • Opportunity to participate through SIG's
  • Conferences

Vocabulary (Elements)

previous table of contents next
(c) Edward Vanhoutte 12 of 51 [121]

Tuning tools (attributes)

previous table of contents next
(c) Edward Vanhoutte 13 of 51 [121]
e.g. Global attributes applicable to all elements
  • xml:id for unique identification
  • n for (non-unique) name or number
  • rend (rendition) indicates how the element in question was rendered or presented in the source text.
  • rendition points to a description of the rendering or presentation used for this element in the source text.
  • xml:lang for language and hence writing-system
  • xml:base provides a base URI reference with which applications can resolve relative URI references into absolute URI references.
→ Extensible, like other classes

Core Module

previous table of contents next
(c) Edward Vanhoutte 14 of 51 [121]
essential common elements
→ abbr add addrLine address analytic author bibl biblScope biblStruct binaryObject cb choice cit corr date del desc distinct divGen editor email emph expan foreign gap gloss graphic head headItem headLabel hi imprint index item l label lb lg list listBibl measure measureGrp meeting mentioned milestone monogr name note num orig p pb postBox postCode ptr pubPlace publisher q quote ref reg relatedItem resp respStmt rs said series sic soCalled sp speaker stage street teiCorpus term time title unclear

TEI Modules

previous table of contents next
(c) Edward Vanhoutte 15 of 51 [121]
  • analysis: adds elements for simple analytic mechanisms
  • certainty: adds elements for recording uncertainty and responsibility
  • corpus: adds specialized elements to the TEI-header for use with language corpora
  • dictionaries: replaces the basic structure with one containing detailed lexicographic features
  • drama: adds specialist tagging for cast lists, records of first performance, etc. to the basic drama markup already included in the core
  • figures: adds elements for encoding tables, pictures, and formulae
  • gaiji: adds elements for epresentation of Non-standard Characters and Glyphs
  • header: adds elements for recording common metadata
  • iso-fs: adds elements for feature structure analysis
  • linking: adds elements for hypertext linking, segmentation, and alignment
  • msdescription: adds elements for description of manscripts
  • namesdates: adds elements for the detailed tagging of names and dates
  • nets: adds elements for recording the abstract structure of mathematical graphs, networks, and trees
  • spoken: replaces the basic structure by one suitable for linguistic analysis of speech acts, etc
  • tagdocs: adds elements for the documentation of the XML elements and element classes which make up any markup scheme
  • tei: TEI infrastructure
  • textcrit: adds elements for text-critical apparatus
  • textstructure: default Text Structure
  • transcr: adds elements for the transcription of primary sources (e.g. manuscripts)
  • verse: adds specialist tagging for metrical analysis, rhyme-scheme etc. to the basic verse markup already included in the core

Structure of a TEI Document

previous table of contents next
(c) Edward Vanhoutte 16 of 51 [121]

<TEI>
 <teiHeader>
    <fileDesc>
       <titleStmt>
          <title><!--Title--></title>
       </titleStmt>
       <publicationStmt>
          <p><!--Publication Information--></p>
       </publicationStmt>
       <sourceDesc>
          <p><!--Information about the source--></p>
       </sourceDesc>
    </fileDesc>
 </teiHeader>
 <text xml:id="text1">
    <body>
       <p>This is the first paragraph</p>
    </body>
  </text>
 </TEI>
                     

A TEI-conformant document

previous table of contents next
(c) Edward Vanhoutte 17 of 51 [121]
comprises a header followed by a text
<TEI>
 <teiHeader>...<teiHeader>
 <text>...<text>
<TEI> 

<teiHeader>

previous table of contents next
(c) Edward Vanhoutte 18 of 51 [121]
The header is essential for:
  • bibliographic control and identification
  • resource documentation and
  • processing (see later)

<teiHeader>: minimal

previous table of contents next
(c) Edward Vanhoutte 19 of 51 [121]
The TEI Header is introduced by the element <teiHeader> that contains descriptive meta-information. This should minimally document following aspects of the electronic file itself (<fileDesc>):
  • the title statement (<titleStmt>), providing information about the title, author and others responsible for the electronic text
  • the publication statement (<publicationStmt>), providing publication details about the electronic text
  • a description of the source (<sourceDesc>), documenting bibliographic details about the electronic text's material source (if any)

<teiHeader>: maximal

previous table of contents next
(c) Edward Vanhoutte 20 of 51 [121]
  1. file description <fileDesc>: contains a full bibliographic description of an electronic file amongst which information about the sources from which the electronic text was derived. Essential for bibliographic referencing and cataloguing.
  2. encoding description <encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived. It allows for documenting detailed information about transcription/transliteration principles such as normalization, the treatment of quotations and hyphenation and the levels of interpretation i.e. analytic tagging and encoding applied to the document.
  3. profile description <profileDesc>: provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their settings.
  4. revision description <revisionDesc>: summarizes the revision history for a file, which is important for version control and for resolving questions about the history of a file, especially when a team of scholars is working on the same document.

<teiHeader>

previous table of contents next
(c) Edward Vanhoutte 21 of 51 [121]
The full form of a TEI Header is thus:
<teiHeader>
 <fileDesc> ... </fileDesc>
 <encodingDesc> ... </encodingDesc>
 <profileDesc> ... </profileDesc>
 <revisionDesc> ... </revisionDesc>
</teiHeader>
               
While a minimal header takes the form:
<teiHeader>
 <fileDesc> ... </fileDesc>
</teiHeader>

Can't cope? → TEILite

previous table of contents next
(c) Edward Vanhoutte 22 of 51 [121]
  • one of many possible views of the TEI DTD
  • small and simple
  • designed to meet 90% of the needs of 90% of the TEI user community
  • 128 elements
  • was devised as a didactic stepstone to the full flavour TEI, but began a life of its own
  • realistic for existing texts and for document production
→ TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release.

A text usually has divisions <div>

previous table of contents next
(c) Edward Vanhoutte 23 of 51 [121]
  • generic, hierarchic subdivisions
  • vanilla or numbered
  • type attribute
  • associated <head> and <trailer>
  • <div>
  • (The full TEI supports numbered divisions: <div0>, <div1>, <div2>, <div3>, <div4>, <div5>, <div6>, <div7>,)

OHCO thesis

previous table of contents next
(c) Edward Vanhoutte 24 of 51 [121]
<text>
 <front>
 <!-- titlepage etc here -->
  </front>
  <body>
   <div type="book" n="1" xml:id="b0100">
    <head>Book1</head>
    <div type="chapter" n="1" xml:id="b0101">
     <head>Chapter 1</head>
     <!-- rest of the chapter -->
    </div>
    <div type="chapter" n="2" xml:id="b0102">
     <head>Chapter 2</head>
      <!-- rest of the chapter -->
     </div>
    </div>
   </body>
  </text> 

Text components in TEI Lite

previous table of contents next
(c) Edward Vanhoutte 25 of 51 [121]
What are divisions made of?
  • Prose is mostly paragraphs ( <p> )
  • Verse is mostly lines ( <l> ), sometimes in hierarchic groups ( <lg> )
  • Drama is mostly speeches ( <sp> ) containing <p> or <l> and interspersed with stage directions ( <stage> )
These may be mixed, and may appear also directly within undivided texts

Prose: an example

previous table of contents next
(c) Edward Vanhoutte 26 of 51 [121]
<p>Cras interdum sollicitudin dui. Vivamus mattis pretium turpis.
Pellentesque dolor lectus, lobortis non, euismod eleifend, feugiat
sit amet, diam. Suspendisse potenti. Proin id massa non ligula
sodales fermentum. In sodales justo eget leo.</p>
<p>Praesent nec felis. Vestibulum ante ipsum primis in faucibus
orci luctus et ultrices posuere cubilia Curae; Praesent ipsum nisi,
sodales id, eleifend nec, lobortis eget, nulla.</p>
               

Verse: an example

previous table of contents next
(c) Edward Vanhoutte 27 of 51 [121]
<lg type="poem">
<lg type="stanza">
<l>Poppadom</l>
<l>Oatmeal</l>
<l>Bubble gum</l>
<l>Cut of veal</l>
</lg>
<lg type="stanza">
<l>Mince for pie</l>
<l>Frozen peas</l>
<l>Video for Guy</l>
<l>Selection of teas</l>
</lg>
<lg type="stanza">
<l>Paper towels/garbage bags</l>
<l>Pasta sauce and Parmesan</l>
<l>Pumpkin seed and olive oil</l>
</lg>
<lg type="stanza">
<l>Cheesy crisps and favourite mags</l>
<l>Kidney beans (1 large can)</l>
<l>>Cling film and kitchen foil</l>
</lg>
</lg>
               

Drama: an example

previous table of contents next
(c) Edward Vanhoutte 28 of 51 [121]
<stage>A customer enters a pet shop.</stage>
<sp who="Customer"><l>Ello, I wish to register a complaint</l></sp>
<stage>The owner does not respond.</stage>
<sp who="Customer"><l>Ello, Miss?.</l></sp>
<sp who="Owner"><l>What do you mean "miss"?</l></sp>
<sp who="Customer"><l>I'm sorry, I have a cold. I wish to make a
   complaint!</l></sp>
<sp who="Owner"><l>We're closin' for lunch.</l></sp>
<sp who="Customer"><l>Never mind that, my lad. I wish to complain about
   this parrot what I purchased not half an hour ago from
   this very boutique.</l></sp>
 
A customer enters a pet shop.
Customer: 'Ello, I wish to register a complaint.
The owner does not respond.
Customer:'Ello, Miss?
Owner: What do you mean "miss"?
Customer: I'm sorry, I have a cold. I wish to make a complaint!
Owner: We're closin' for lunch.
Customer: Never mind that, my lad. I wish to complain about this parrot what I purchased not half an hour ago from this very boutique.
→ Monthy Python, The Dead Parrot Sketch

Page and line numbers

previous table of contents next
(c) Edward Vanhoutte 29 of 51 [121]
  • <pb />: pagebreak
  • <lb />: linebreak

Whan that Aprill with his shoures soote<lb />
The droghte of March hath perced to the roote,<lb />
And bathed every veyne in swich licour<lb />
Of which vertu engendred is the flour;<lb />
<pb ed="riverside" n="23" />
Whan Zephirus eek with his sweete breeth<lb />


               

Texts are not just words...

previous table of contents next
(c) Edward Vanhoutte 30 of 51 [121]
  • but probably only people know that
  • an encoding may claim to capture
    • just visual salience
    • just its assumed causes
    • both
  • encoding makes explicit one (or more) sets of interpretations

Highlighted phrases

previous table of contents next
(c) Edward Vanhoutte 31 of 51 [121]
<hi>: used to indicate typographic highlighting:
<hi rend="gothic">And this Indenture further witnesseth</hi>
   that the said <hi rend="italic">Walter Shandy</hi>, merchant,...

Or

previous table of contents next
(c) Edward Vanhoutte 32 of 51 [121]
<hi> used to indicate typographic highlighting:
<hi rend="gothic">And this Indenture further witnesseth</hi>
that the said <hi rend="italic">Walter Shandy</hi>, merchant,...
A descriptive tag can document the reason for highlighting:
<seg type="formula">And this Indenture further witnesseth</seg>
 that the said <name rend="italic">Walter Shandy</name>, merchant,...

Interpretive highlighting

previous table of contents next
(c) Edward Vanhoutte 33 of 51 [121]
<emph>: used to indicate linguistic (not typographic) emphasis
You did <emph>what?</emph>
Varieties of linguistic distancing: <gloss>, <term>, <soCalled>, and <mentioned>
A <term>DTD</term> specifies <gloss>the vocabulary
and the syntax of a markup language</gloss>
They put us out of work and call it <soCalled>downsizing<soCalled>
<mentioned>Downsizing</mentioned> is a very nasty neologism

<q>: Direct - Indirect speech & thought

previous table of contents next
(c) Edward Vanhoutte 34 of 51 [121]
  • Use the type attribute to indicate whether it is spoken or thought
  • Use the who attribute to show speakers
  • Speeches can be nested in other speeches

<p>Praesent a orci. Donec cursus augue in leo. Nam tristique. Morbi
consequat diam in neque. Nullam ac ipsum laoreet mi porta fringilla
Barbara: <said who="Barbara" direct="true" aloud="true" rend="inline">
“Vestibulum ante ipsum primis in faucibus
orci luctus et ultrices posuere cubilia Curae”</said>;
Etiam placerat hendrerit lacus. Nulla sodales.</p> 

<q>: Direct - Indirect speech & thought

previous table of contents next
(c) Edward Vanhoutte 35 of 51 [121]
Whether or not the quotation marks (or any other mark) are explicitly transcribed and encoded is up to the encoder. Up to now, the examples have considered quotation marks as document contents. Alternatively the rendering of the quotation marks can be documented inside an <att>rend</att> attribute using some appropriate set of conventions. A possible alternative for one of the examples above could be:

 <p>Praesent a orci. Donec cursus augue in leo. Nam tristique. Morbi
 consequat diam in neque. Nullam ac ipsum laoreet mi porta fringilla
 Barbara: <said who="Barbara" direct="true" aloud="true" rend="PRE ldquo POST rdquo">
 Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae</said>;
 Etiam placerat hendrerit lacus. Nulla sodales.</p>
               

"Foreign" language phrases

previous table of contents next
(c) Edward Vanhoutte 36 of 51 [121]
  • The xml:lang attribute may be attached to any element
  • Use <foreign> if nothing else is available
  • Define each language in <langUsage> in the <teiHeader>
<profileDesc>
 <langUsage>
 <language xml:id="deu">German</language>
 <language xml:id="fra">French</language>
 </langUsage>
</profileDesc>
 
<p>Have you read <title xml:lang="deu">Die Dreigroschenoper</title>?
<mentioned xml:lang="fra">Savoir-faire</mentioned> is French for know-how.
John has real <foreign xml:lang="fra">savoir-faire</foreign>.</p>
               

Phrase level elements

previous table of contents next
(c) Edward Vanhoutte 37 of 51 [121]
  • are often by convention typographically distinct
  • "data-like" (names, numbers, dates, times, addresses)
    • <name>
    • <num>
    • <date>
    • <time>
    • <address> & <addrLine>
  • editorial intervention (corrections, regularizations, additions, omissions ...)
    • <corr>
    • <sic>
    • <reg>
    • <orig>
    • <add>
    • <del>
  • cross references and links
    • <ptr />
    • <ref>

Dates, times, numbers

previous table of contents next
(c) Edward Vanhoutte 38 of 51 [121]
  • attributes can be used to quantify <date> expressions
  • similarly, times <time>, and numbers <num>
Today is <date>Friday 4th</date>
Today is <date value="2008-12-04">Friday 4th</date>
One day in <date certainty="approx" value="2008-12">early December</date>
Its now <time value="12:00">noon</time>.
<num value="4">four</num>
<num value="3.1419">pi</num>

Omissions, Deletions, Additions

previous table of contents next
(c) Edward Vanhoutte 39 of 51 [121]
  • <gap>: omission by transcriber
  • <del>: cancellation in source or by editor
  • <add> or <supplied>: insertion in source or by editor
  • <unclear>: material uncertain because illegible
  • <damage>: physical damage to text carrier
<gap reason="illegible" desc="bloodstain">
He was <del resp="EV" hand="author" type="strike">not</del> very nice.
He was <add resp="EV" hand="author" place="supralinear">not</add> very nice.
He was <unclear resp="EV" reason="ink stain">not</unclear> very nice. 

Abbreviations and expansions

previous table of contents next
(c) Edward Vanhoutte 40 of 51 [121]
  • <abbr>: contains an abbreviation. It's expansion is given in an "expan" attribute.
 We learn how to write <abbr expan="eXtensible Markup Language">XML</abbr>
               

Addresses

previous table of contents next
(c) Edward Vanhoutte 41 of 51 [121]
  • <address>: contains a postal or other address.
  • <addrLine>: contains a line inside an <address>.
<address>
<addrLine>Koninklijke Academie voor Nederlandse Taal- en Letterkunde</addrLine> <addrLine>Koningstraat 18</addrLine> <addrLine>b-9000 Gent</addrLine> <addrLine>tel: +32 (0)9 265.93.50</addrLine> <addrLine>fax: +32 (0)9 265.93.49</addrLine> <addrLine>email: ctb@kantl.be</addrLine> </address>

Lists

previous table of contents next
(c) Edward Vanhoutte 42 of 51 [121]
<list>: contains any sequence of items organized as a list.
<item>: contains one component of a list.
<label>: contains the label associated with an item in a list; in glossaries, marks the term being defined.
<list type="ordered">
<item>Week 1</item> <item>Week 2</item> <item>Week 3</item> <item>Week 4</item> <item>Week 5</item> </list>
<list type="gloss">
<head>Vocabulary</head>
<label xml:lang="enm">Whan that</label> <item>When</item>
<label xml:lang="enm">Aprill</label> <item>April</item>
<label xml:lang="enm">with</label> <item>with</item>
<label xml:lang="enm">his</label> <item>its</item>
<label xml:lang="enm">shoures</label> <item>showers</item>
<label xml:lang="enm">soote</label> <item>sweet</item>
</list>

Notes

previous table of contents next
(c) Edward Vanhoutte 43 of 51 [121]
<note>: contains a note or annotation.
<p>Indeed, hypertext is just the visualization of linking which
   DeRose & Van Dam define as "the ability to express relationships
   between places in a universe of information"
   <note place="foot" n="1">"A place should be any piece of
   information, or at least any that exists in a stable or recoverable
   form." (DeRose & van Dam 9).</note></p>
               

<table>

previous table of contents next
(c) Edward Vanhoutte 44 of 51 [121]
  • a <table> element contains <row>s of <cell>s
  • spanning is indicated by rows and cols attributes
  • role attribute indicated whether <row> or <cell> holds data or a label
  • embedded tables are permitted
<table rows="2" cols="2">
<row role="label">
<cell>Male Students</cell>
<cell>Female Students</cell>
</row>
<row role="data">
<cell>3</cell>
<cell>6</cell>
</row>
</table>
               

Bibliography

previous table of contents next
(c) Edward Vanhoutte 45 of 51 [121]
Use simple <bibl> with optional subcomponents:
  • <respStmt> (for any kind of responsibility) or <author>, <editor>, etc.
  • <title> with optional level attribute
  • <imprint> groups publication details
  • <biblScope> adds page references
  • Use <listBibl> for list of references

<bibl>
  <author>Walsh, Marcus</author>
  <date>(1993)</date>
  <title level="a">The Fluid Text and the Orientations of Editing.</title>
  <editor>Chernaik, Warren, Caroline Davis, and Marilyn Deegan</editor>
  <title level="m">The Politics of the Electronic Text</title>
  <imprint>
     <pubPlace>Oxford</pubPlace>
     <publisher>Office for Humanities Communication</publisher>
  </imprint>
  <biblScope>31-39</biblScope>
</bibl>

               

Referring strings

previous table of contents next
(c) Edward Vanhoutte 46 of 51 [121]
The <rs> (referring string)
→ element is used for any kind of name or reference
<q>My dear <rs type="person" key="BENM1">Mr. Bennet</rs>,</q>
said <rs type="person" key="BENM2">his lady</rs> to him one day,
<q>have you heard that
<rs type="place" key="NETP1">Netherfield Park</rs> is let
at last?</q>
               

<figure>

previous table of contents next
(c) Edward Vanhoutte 47 of 51 [121]
  • The presence of a graphic is indicated by the <figure> element
  • The title of the graphic is tagged as a <head>
  • A description of the graphic may be supplied (as a <figDesc>) for use by software unable to render the graphic
  • The graphic itself is specified as an external entity
<!ENTITY logoctb SYSTEM "logoctb.gif" NDATA GIF>
<figure entity="logoctb">
<head>The logo of the CTB</head>
<figdesc>The letters c, t, and b in red with black border on
 a white field</figdesc>
</figure>

The multiple hierarchy

previous table of contents next
(c) Edward Vanhoutte 48 of 51 [121]
problem
  • XML allows only one hierarchy at a time
  • Is a document
    • chapter-paragraph-phrase
    • gathering-page-leaf
    • or both?
  • discontinuous segments
  • links and milestones

But what are we doing with this?

previous table of contents next
(c) Edward Vanhoutte 49 of 51 [121]

Putting it al together

previous table of contents next
(c) Edward Vanhoutte 50 of 51 [121]

Webliography: TEI

previous table of contents next
(c) Edward Vanhoutte 51 of 51 [121]