Manuscript and Electronic Text Academy - META07
Text Encoding, Markup and XML
Faculdade de Letras da Universidade de Lisboa
Tuesday 30 January
Edward Vanhoutte

Manuscript and Electronic Text Academy - META07
<titlepart>Tuesday 30 January Session 4</titlepart>Faculdade de Letras da Universidade de Lisboa
Tuesday 30 January
Edward Vanhoutte

I. Tuesday 30 January: Text encoding and markup
- 9.00-10.30: 1.1. Introductions (Edward Vanhoutte)
- This Workshop
- XML: a first successful attempt
- 11.00-12.30: 1.2. Hypertextual Structures of Complex Writing Processes (Dirk Van Hulle)
- 13.30-15.00: 1.3. Versions and Compositional Variants (Dirk Van Hulle)
- 15.30-17.00: 1.4. Text-encoding, Markup, XML and TEI (Edward Vanhoutte)
- Workflow
- Why Electronic Texts?
- Text encoding & markup
- The Text Encoding Initiative
- Why XML?
- Document analysis
- Structure and contents
- Markup
- OHCO
- DTD
Workflow: mantra in 9 lines | |
- Project definition
- Document analysis
- Encoding design
- Coding
- Validating
- Functionality: storyboard
- Transform, generate, implement
- Document
- Publish
Why electronic texts? Macro level | |
2 kinds of electronic texts:
- digitally created (born digital)
- digitized
→ 2 different answers to the questions:
- technological progress
-
- create access to material
- support preservation policy
- collection acquisition/completion
- flexible use
- institutional and strategic advantage (prestige)
- research
- education
- Short term thinking
- document creation/production
- print
- mail
- document lay-out
- ease of the word processor
- WYSIWYG interface
- → .rtf, .doc, .wpd, .pdf, .xls, .dmb
- → Word, WordPerfect, Adobe Acrobat, Excel, Access
- Long term thinking
- justify investments
- retain access to material
- maintain project
- exchange of data
- document structure
- → ISO standards and W3C norms
Why electronic texts? Micro level | |
Your own research:
- Research question
- Specific history and story of your text
- Your institutional context
- Your mehodological network
BUT Text & Computer: a problematic relationship | |
Texts cannot be put into computers. Neither can numbers. Computers can contain and operate on patterns of electronic charges, but they cannot contain numbers, which are abstract mathematical objects not electronic charges, nor texts, which are complex, abstract cultural and linguistic objects.
→ Michael Sperberg-McQueen, 'Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.' Literary and Linguistic Computing, 6/1 (1991): 34-46. (34)
Computers work with a representation of text | |
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.
William Gibson é um guru do {\i science fiction} e {\i Neuromancer} é seu {\i masterpiece} do {\i cyberpunk}.
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.
William Gibson é um guru do {\i science fiction} e {\i Neuromancer} é seu {\i masterpiece} do {\i cyberpunk}.
000101110000101000000011111010001111111011111
011001000000111111010100100010111101000100100
001111100100010011111110001010101010101010111
110001011100001010110001001000010000000010101
011110101001100010111000010101100010010000111
- Interchange between systems and platforms causes loss of information
→ Short data life cycle
- Loss of semantic information in translation to visual information
Example: Synoptic Edition | |
An international(ly) accepted standard which:
- is software and platform independent
- can describe the logical, structural, and semantic elements of a text
An international(ly) accepted standard which:
- is software and platform independent
- can describe the logical, structural, and semantic elements of a text
→ Markup
William Gibson é um guru do
<language what="English">science fiction</language>
e <title>Neuromancer</title> é seu
<language what="English">masterpiece</language>
<term>cyberpunk</term>
XML: eXtensible Markup Language | |
XML is a metalanguage by which one can create separate markup languages for seperate purposes.
- W3C recommendation
- The standard on the internet
- Widely used standard for markup and exchange of information
- XML is stuctured data, represented as strings of text
- XML is platform-, software-, and system-independent
- No one 'owns' XML
- XML empowers the content provider and facilitates data integration, exchange, maintenance and extraction.
- XML is an SGML system application
- So it looks a bit like HTML, except that
- XML is extensible
- XML must be well-formed
- XML can be validated
(S)GML - XML: A bit of history | |
- Need for a platform and software independent format for data storage and interchange
- IBM: Generalized Markup Language (GML)
- ISO 8879:1986: Standard Generalized Markup Language (SGML): Charles Goldfarb
- SGML is a meta-language: a language for the description of markup languages
- Markup languages are commonly named after their
- document type definition (DocBook)
- aim (LinuxDoc)
- application (TEI)
- the standard they implement (J2003: automobile parts)
- The DTD is the core of an SGML system
- SGML is the solution for the industry: expensive and complicated but at the same time very powerful
Documents (texts) exist in different types:
- Poetry
- Prose
- Drama
- Letters
- Bibliographies
- Dictionaries
- Lists
- ...
→ Different organisation/order, different structures, different text ontology, different semantics
→ DTD provides the vocabulary and the syntax for a specific markup language
- Comprehensiveness.
- Simplicity.
- Documents be processable by software of moderate complexity.
- Standard not be dependent on any particular characteristic set or text-entry devise.
- The standard not be geared to any particular analytic program or printing system.
- The standard should describe text in editable form.
- The standard allow the interchange of encoded texts across communication networks.
→ Standard reference work: Charles F. Goldfarb. (1990). The SGML Handbook. Oxford: Clarendon Press.
The Text Encoding Initiative (TEI) | |
11-12 November 1987: Vassar College, Poughkeepsie (NY)
→ 32 humanities scholars
Principles:
- Platform-independent
- Software-independent
- Endurability
- Re-usability
- Accessibility
- Language-independent
- For all of the Humanities disciplines
→ SGML ISO 8879:1986
→ Sperberg-McQueen, C.M. and Burnard, L. (eds.) (2002). TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML Version (Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.)
→ Sperberg-McQueen, C. M. and Burnard, L. (eds.) (2005).
TEI P5. Guidelines for Electronic Text Encoding and Interchange. Revised and re-edited. (Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.)
http://www.tei-c.org/release/doc/tei-p5-doc/html/
- 1987: Vassar College, Poughkeepsie
- 1990: P1 : SGML
- 1992: P2 : SGML
- 1994: P3 : SGML : 600 + elements
- 1995: TeiLite : SGML : 131 elements
- 1999: P3rev : SGML
- 2000: TEI Consortium
- 2001: P4 : XML comp
- 2001: TeixLite : XML
- 2005-2007: P5 : XML
So: Why would you want to learn about XML? The short answer: | |
→ XML is not the end of the world, and won't solve all your problems, but:
- It's a good approximation
- It works (fairly well)
- It's widely supported
- It's a W3C recommendation
Why would you want to learn about XML? The shortest answer: | |
→ It's fun.
Why XML? The longer answer: | |
- Internationally suported (de facto) standard:
- W3C recommendation
- wide support
- Software and platform independent
- Describes the logical, structural, as well as semantic elements of a text
- Relatively easy to master
- Self declarative: for the machine and for human beings
- Low creation cost thanks to open source support
- Design of XML is formal and concise (Extended Backus-Naur Format - EBFN)
- It is relatively easy to write programs which process XML documents.
- Internationalisation: Unicode
- Flexible output
- Can be used with existing web protocols (HTTP) and mechanisms (URL)
- Related standards in the XML family empower XML: XSLT, XSLFO, XPath, XQuery, Xlink, etc.
Why would you want to learn about XML? | |
→ XML will not live forever
- XML will develop into ... ?
- Other standards
Why would you want to learn about XML? | |
→ XML will not live forever
- XML will develop into ... ?
- Other standards
BUT
- Migration will be supported
- Migration will create more jobs
Why would you want to learn about TEI? | |
- Best guarantee for intellectual sustainability of your research efforts.
- Supported by a community
- No 'take it or leave' it approach: extendible, modifiable
- It's free
Workflow: mantra in 9 lines | |
- Project definition
- Document analysis
- Encoding design
- Coding
- Validating
- Functionality: storyboard
- Transform, generate, implement
- Document
- Publish
Markup is a way to provide instructions about how a text should look like.
This implies:
- stable agreements on ontology and semantics
- a community who knows these agreements

eg.: Abraham Lincolns "Gettysburg Address"

eg.: Abraham Lincolns "Gettysburg Address"

- Texts are more than simply sequences of glyphs
→ They have structure and content and they also have multiple readings
Document analysis: Exercise | |
Workflow: mantra in 9 lines | |
- Project definition
- Document analysis
- Encoding design
- Encoding
- Validating
- Functionality: storyboard
- Transform, generate, implement
- Document
- Publish
- Texts are more than simply sequences of glyphs
→ They have structure and content and they also have multiple readings
- Text encoding or markup provides a means of making such structure, content, and readings explicit
→ Only what is explicitly articulated can be digitally processed
Texts have structure and content 1 | |
POOREDWARDDIDYOUHEARTHENEWSABOUTEDWARD?ONTHEBACKOFHISHE
ADHEHADANOTHERFACEWASITAWOMAN'SFACEORAYOUNGGIRLTHEYSAID
TOREMOVEITWOULDKILLHIMSOPOOREDWARDWASDOOMEDTHEFACECOULD
LAUGHANDCRYITWASHISDEVILTWINANDATNIGHTSHESPOKETOHIMTHIN
GSHEARDONLYINHELLBUTTHEYWEREIMPOSSIBLETOSEPARATECHAINED
TOGETHERFORLIFEFINALLYTHEBELLTOLLEDHISDOOMHETOOKASUITEO
FROOMSANDHUNGHIMSELFANDHERFROMTHEBALCONYIRONSSOMESTILLB
ELIEVEHEWASFREEDFROMHERBUTIKNEWHERTOOWELLISAYSHEDROVEHI
MTOSUICIDEANDTOOKPOOREDWARDTOHELL
Texts have structure and content 2 | |
Poor Edward. Did you hear the news about Edward? On the back of
his head he had another face. Was it a woman's face or a
young girl? They said to remove it would kill him, So poor
Edward was doomed. The face could laugh and cry. It was his devil
twin. And at night she spoke to him things heard only in hell. But
they were impossible to separate, Chained together for life. Finally
the bell tolled his doom. He took a suite of rooms and hung himself
and her from the balcony irons. Some still believe he was freed
from her, but I knew her too well. I say she drove him to suicide,
and took poor Edward to hell.
Texts have structure and content 3 | |
Poor Edward
Did you hear the news about Edward?
On the back of his head he had another face
Was it a woman's face or a young girl?
They said to remove it would kill him
So poor Edward was doomed
The face could laugh and cry
It was his devil twin
And at night she spoke to him
Things heard only in hell
But they were impossible to separate
Chained together for life
Finally the bell tolled his doom
He took a suite of rooms
And hung himself and her from the balcony irons
Some still believe he was freed from her
But I knew her too well
I say she drove him to suicide
And took poor Edward to hell
Texts have structure and content 4 | |
Poor Edward
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>
Texts have structure and content 5 | |
Poor Edward
<stanza>
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
</stanza>
<stanza>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
</stanza>
<stanza>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>
</stanza>
Texts have structure and content 6 | |
<poem>
<title>Poor Edward</title>
<stanza>
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
</stanza>
<stanza>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
</stanza>
<stanza>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>
</stanza>
</poem>
- Markup should be separated from content
- Markup should be processable → logical & predictable
Markup should be separated from content | |
Use of tags with open and close delimiters
<tag>content</tag>
Markup should be processable | |
→ OHCO thesis
Russian Doll or OHCO Thesis | |
A document is an Ordered Hierarchy of Content Objects
Russian Doll or OHCO Thesis | |
A document is an Ordered Hierarchy of Content Objects
<book>
<chapter n="1">
<section n="1">
<p>...</p>
<p>...</p>
</section>
<section n="2">
<p>...</p>
<p>...</p>
</section>
</chapter>
<chapter n="2">
<!-- more sections and paragraphs -->
</chapter>
</book>
Markup should be processable | |
→ Document Type Definition (DTD) [or schema, see later]
Document Type Definition (DTD) | |
Different documenttypes → different organization/order & different content objects
- Poetry
- Prose
- Drama
- Letters
- Bibliographies
- Dictionaries
- Lists
- ...
Document Type Definition (DTD) | |
Different documenttypes → different organization/order & different content objects
- Poetry
- Prose
- Drama
- Letters
- Bibliographies
- Dictionaries
- Lists
- ...
→ The rules of the game
Document Type Definition (DTD) | |
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
- names for all your elements
- names and default values for their attributes
- rules about how elements can nest
- names for re-usable pieces of data (entities)
- and a few other things
A DTD does not specify anything about what elements "mean"
- Computers work with representations of text
- Proprietary software merges structure, content, meaning, and layout in one code
→ 2 problems
- Short data life cycle
- Visual information = semantic information
- Solution: standard for text encoding
→ Markup explicitly articulates structure, content, and readings
- Markup should be separated from content
→ Tags
- Markup should be processable
→ OHCO thesis
- A markup language is defined by its Document Type Definition
II. XML: theory & practice
How do we recognize an XML document? | |
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
- ?: Processing Instruction
- xml: this is XML
- version="1.0": specifies which version of the spec is used here
How do we recognize an XML document? | |
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
- ?: Processing Instruction
- xml: this is XML
- version="1.0": specifies which version of the spec is used here
→ The root element
<root>Content</root>
This is the largest Russian doll containing all the others
<?xml version="1.0" ?>
<root>Hello world</root>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
<?xml version="1.0" ?>
<?xml-stylesheet href="../dtd/xsltslides.xsl" type="text/xsl" ?>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
e.g.
<title> or </title>
→ Exception: empty elements: <empty />
→ XML is case sensitive in the naming of the elements. <title> is not <TITLE> is not <Title> is not <tItle> etc.
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
Elements can have one or more optional or required attributes which provide more information on the element.
Syntax:
<element.name attribute.name="attribute.value">
e.g.
<title type="journal">
<name who="Edward" reg="VanhoutteE" >
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
An entity is a named sequence of characters, predefined for convenience. Typical uses include:
- to represent characters which cannot reliably be typed in
- as a shortcut for boiler plate text
- containers for external (non-XML) data such as graphics
Syntax:
&entity.name;
e.g.
é = é = é
ü = ü = ü
&mayonnaise; = reference to an external recipe (text) for mayonnaise
<figure entity="ascii-full" />
Characters such as & (&) and < (<) which are used by XML, or non-ascii characters can be escaped by the use of set entity references. ISO 8859, ISO/IEC 10646 (decimal or hexadecimal), Unicode.
Note on ISO/IEC 10646
- XML requires the use of ISO/IEC 10646 or Unicode
- Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it.
- Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications". (section Unicode & ISO 10646 of the Unicode FAQ <http://www.unicode.org/unicode/faq/>)
- a 32 bit character repertoire including most human writing systems
- encoded as UTF8 (Web), UTF16 (Java and Windows), or UTF32 (Unix).
- Other encodings may be specified at the document level
ISO 8859 references can still be used in the XML document (Isolat1, Isolat2, Isopub, Isonum) when the entities are internally or externally (in the DTD) mapped on Unicode
Syntax
<!ENTITY eacute "é" ><!-- small e, acute accent -->
<!ENTITY TEI "Text Encoding Initiative">
<!ENTITY allc SYSTEM "../graphics/allc.gif" NDATA gif>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
Character data: allows you to include application code (JavaScript, Perl, BASIC, etc.) in an XML document without having to worry about escaping characters. A CDATA section is treated by the XML processor as it appears
Syntax
<![CDATA[This text escapes the XML processor]]>