What is TEI all about?
Text Encoding and markup for the Humanities
University College London
6 November 2009
Edward Vanhoutte
edward.vanhoutte@kantl.be

Royal Academy of Dutch Language and Literature
VI. XML
Why would you want to learn about XML? | |
→ XML is not the end of the world, and won't solve all your problems,
but:
- It's a good approximation
- It works (fairly well)
- It's widely supported
- It's a W3C recommendation
Why would you want to learn about XML? | |
→ XML will not live forever, but:
- Migration will be supported
- Migration will create more jobs
Why would you want to learn about XML? | |
→ It's fun.
XML: Five Essential Components | |
XML: Five Essential Components | |
- Processing Instructions: <?xml version="1.0" ?>
- Elements: <title> or </title>
- Attributes (optional): <title
type="journal"
>
- Entity References: é = é = é
- CDATA: <![CDATA[This text escapes the XML
processor]]>
→ XML is case sensitive in the naming of elements
<title> is not <TITLE> is not <Title> is not <tItle> etc.
- Well-formed XML
- Valid XML
- XML is case sensitive
- At least one element
- There is always a root element
- All logical and physical structures nest properly
- Correspondence of element names in start- and end-tags
- Attribute names only appear once in a start-tag
- Attribute values are quoted
- Attribute values do not refer to external entities
- Entities are declared
- No entity reference contains the name of a non-parsed entity
- Well-formed XML
- Conforms to a Document Type Definition (DTD)
- Or to a valid Schema
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
<salutation>hello</salutation>
<target>world</target>
</greeting>
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
<salutation>hello</salutation>
<target>world</target>
</greeting>
DTD:
<!ELEMENT greeting (salutation, target)>
<!ELEMENT salutation (#PCDATA)>
<!ELEMENT target (#PCDATA)>
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
- names for all your elements
- names and default values for their attributes
- rules about how elements can nest
- names for re-usable pieces of data (entities)
- and a few other things
A DTD does not specify anything about what elements "mean"
Defining an element in a DTD | |
<!ELEMENT name contentModel>
- name is the name of the element (GI)
- contentModel defines valid content for the element:
- #PCDATA
- EMPTY
- other elements
- mixed content combining PCDATA and other elements
Theoretical problem with the DTD | |
- DTDs are written in a formal language different from the XML
documents.
→ XML Schema
- A self-declarative way of documenting the formal specification for the
structure of an XML document
- Expressed in XML itself
Schema can deal with namespaces and DTDs can not
An XML namespace is a collection of names, identified by a URI reference, which
are used in XML documents as element types and attribute names.
e.g.: <table> in doc1 is not necessarily the same as <table> in
doc2
→There are a couple of Schema languages around:
- W3C XML Schema: http://www.w3.org/XML/Schema
- Relax NG: http://www.relaxng.org
- XML-Data (XDR)
- Document Content Description (DCD)
- Schema for Object-oriented XML (SOX)
- Document Definition Markup Language (DDML)
- Schematron
- Datatypes for DTDs (DT4DTD)
- Document Structure Description (DSD)
- Regular Language Description for XML (RELAX)
- TREX (Tree Regular Expressions for XML)
- Examplotron
- Hook
- Document Schema Definition Language (DSDL)
- STEP/EXPRESS and XML
<!ELEMENT book (chapter+)>
<!ELEMENT chapter (section+)>
<!ELEMENT section (p+)>
<!ELEMENT p (#PCDATA)>
Book Schema (Relax NG Simple notation) | |
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
>
<start>
<ref name="book" />
</start>
<define name="book">
<element>
<name>book</name>
<oneOrMore>
<ref name="chapter" />
</oneOrMore>
</element>
</define>
<define name="chapter">
<element>
<name>chapter</name>
<oneOrMore>
<ref name="section" />
</oneOrMore>
</element>
</define>
<define name="section">
<element>
<name>section</name>
<oneOrMore>
<ref name="p" />
</oneOrMore>
</element>
</define>
<define name="p">
<element>
<name>p</name>
<text />
</element>
</define>
</grammar>
<!ELEMENT book (chapter+)>
|
<define name="book">
<element>
<name>book</name>
<oneOrMore>
<ref name="chapter" />
</oneOrMore>
</element>
</define>
|
<!ELEMENT chapter (section+)>
|
<define name="chapter">
<element>
<name>chapter</name>
<oneOrMore>
<ref name="section" />
</oneOrMore>
</element>
</define>
|
<!ELEMENT section (p+) >
|
<define name="section">
<element>
<name>section</name>
<oneOrMore>
<ref name="p" />
</oneOrMore>
</element>
</define>
|
<!ELEMENT p (#PCDATA)>
|
<define name="p">
<element>
<name>p</name>
<text />
</element>
</define>
|
- A DTD/Schema is very useful at data preparation time: validating editors
only allow you to input correct markup
- Useful for consistent encoding in projects
- Guarantees longevity and interchangeability of semantics and structure in
encoded texts
- We need it for validation
- Useful for software development and operability
A valid XML document will reference a Document Type Declaration
A Document Declaration contains the reference to a Document Type Definition (DTD)
or a Schema
<!DOCTYPE TEI PUBLIC "-//TEI//DTD TEI Lite 1.0//EN"
"../dtd/teixlite.dtd" [ ]>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
XML can be validated when we have:
- an XML document
- a DTD or Schema
- a validating parser
→ Bad news
- Difficult
- Time Consuming
- Expensive
→ Bad news
- Difficult
- Time Consuming
- Expensive
→ Good news
- The work has been done for you
VII. TEI
Text Encoding Initiative (TEI) | |
a standard for the representation of textual material in digital form through the means of text encoding.
- provides DTD subsets, Schema modules, and elements for the encoding of:
- texts in any natural language
- of any date
- in any literary genre or text type
- without restrictions on form or content
- treats both continuous materials ('running text') and discontinuous materials
such as dictionaries and linguistic corpora.
- supports the encoding of all kinds of features of all kinds of texts studied
by researchers
- is application independent
→ It's free!
What does the TEI offer you? | |
The TEI has produced a number of DTD subsets/Schema fragments which can be
combined according to the needs for a particular project in the humanities.
- Vocabulary (elements)
- Tuning tools (attributes)
- Syntax (content models and nesting rules)
- Modification and extension guidelines
- Edit texts (e.g. word processors, syntax-directed editors)
- Edit, display, and link texts in hypertext systems
- Format and print texts using desktop publishing systems, or batch-oriented
formatting programs
- Load texts into free-text retrieval databases or conventional databases
- Unload texts from databases as search results or for export to other
software
- Search texts for words or phrases
- Perform content analysis on texts
- Collate texts for critical editions
- Scan texts for automatic indexing or similar purposes
- Parse texts linguistically
- Analyze texts stylistically
- Scan verse texts metrically
- Ink text and images
→ The aim has been to make the TEI Guidelines useful for encoding the
same texts for different purposes.
How does the TEI offer this to you? | |
- Website http://www.tei-c.org
- On-line reference documentation: Guidelines
- Print reference documentation: Guidelines
- On-line schema/DTD generator: ROMA
- On-line tutorials: TEI by Example http://www.teibyexample.org
- Free software
- Free stylesheets
- Community of practitioners: TEI-L maillist
- Experience of projects
- Opportunity to participate through SIG's
- Conferences
- TEI Infrastructure
- TEI Header
- Elements Available in All TEI Documents
- Default Text Structure
- Representation of Non-standard Characters and Glyphs
- Verse
- Performance Texts
- Transcriptions of Speech
- Dictionaries
- Manuscript Description
- Representation of Primary Sources
- Critical Apparatus
- Names, Dates, People, and Places
- Tables, Formulae, Graphics
- Language Corpora
- Linking, Segmentation, Alignment
- Simple Analytical Mechanisms
- Feature Structure
- Graphs, Networks, Trees
- Certainty and Responsibility
- Documentation Elements
A TEI-conformant document | |
comprises a header followed by a text
The header is essential for:
- bibliographic control and identification
- resource documentation and
- processing (see later)
- Structural
- Renditional
- Logical & Semantic
- Analytical
- Prose: paragraphs <p>, divisions <div>, headings <head>, lists <list>, listitem <item>, quotations <q>, page breaks <pb>, segments <seg>, figures <figure>, and tables <table>.
- Verse: linegroups <lg> and lines <l>.
- Drama: divisions <div>, speeches <sp>, paragraphs <p>, linegroups <lg>, lines <l>. and segments <seg>.
Structural Features: Prose | |
Structural Features: Verse | |
Structural Features: Drama | |
A customer enters a pet shop.
Customer: 'Ello, I wish to register a complaint.
The owner does not respond.
Customer:'Ello, Miss?
Owner: What do you mean "miss"?
Customer: I'm sorry, I have a cold. I wish to make a complaint!
Owner: We're closin' for lunch.
Customer: Never mind that, my lad. I wish to complain about this parrot what I purchased not half an hour ago from this very boutique.
Structural Features: Titlepage | |
Renditional Features: Highlighting | |
Commonly Highlighted
- Emphasis <emph>, foreign words <foreign> and other linguistically distinct uses <distinct> of highlighting
- The use quotation marks in the representation of speech and thought <said>, quotation <quote>, cited quotation <cit>. words or phrases mentioned <mentioned> and words or phrases for which the author or narrator indicates a disclaiming of responsibility <soCalled>.
- Technical terms <term>, glosses <gloss> or documentation of XML elements, attributes and classes with <altIdent>, <desc>, <equiv/>
Commonly not highlighted
<title>, <name>, <num>, <measure>, <date>, <address>, <abbr> and <expan>.
- Ommissions, deletions, additions: <gap>, <del>, <add>, <supplied>, <unclear>, <damage>
- Regularization: <reg>, <orig>
- Editorial interventions: <corr>, <sic>
problem
- XML allows only one hierarchy at a time
- Is a document
- chapter-paragraph-phrase
- gathering-page-leaf
- or both?
- discontinuous segments
- links and milestones
But what are we doing with this? | |
Losing the Pedals? → TEILite | |
- one of many possible views of the TEI DTD
- small and simple
- designed to meet 90% of the needs of 90% of the TEI user community
- 128 elements
- was devised as a didactic stepstone to the full flavour TEI, but began a
life of its own
- realistic for existing texts and for document production
→ TEI Lite: Encoding for Interchange: an introduction to the TEI Revised for TEI P5 release.
Losing the Pedals? → TEI By Example | |