Manuscript and Electronic Text Academy - META07

Text Encoding, Markup and XML

Faculdade de Letras da Universidade de Lisboa

Tuesday 30 January

Edward Vanhoutte

TOC | First

Manuscript and Electronic Text Academy - META07

<titlepart>Tuesday 30 January Session 4</titlepart>

Faculdade de Letras da Universidade de Lisboa

Tuesday 30 January

Edward Vanhoutte

TOC | First


I. Tuesday 30 January: Text encoding and markup

Tuesday 30 January

previous table of contents next
(c) Edward Vanhoutte 1 of 58 [73]
  • 9.00-10.30: 1.1. Introductions (Edward Vanhoutte)
    • This Workshop
    • XML: a first successful attempt
  • 11.00-12.30: 1.2. Hypertextual Structures of Complex Writing Processes (Dirk Van Hulle)
  • 13.30-15.00: 1.3. Versions and Compositional Variants (Dirk Van Hulle)
  • 15.30-17.00: 1.4. Text-encoding, Markup, XML and TEI (Edward Vanhoutte)
    • Workflow
    • Why Electronic Texts?
    • Text encoding & markup
    • The Text Encoding Initiative
    • Why XML?
    • Document analysis
    • Structure and contents
    • Markup
    • OHCO
    • DTD

Workflow: mantra in 9 lines

previous table of contents next
(c) Edward Vanhoutte 2 of 58 [73]
  1. Project definition
  2. Document analysis
  3. Encoding design
  4. Coding
  5. Validating
  6. Functionality: storyboard
  7. Transform, generate, implement
  8. Document
  9. Publish

Why electronic texts? Macro level

previous table of contents next
(c) Edward Vanhoutte 3 of 58 [73]
2 kinds of electronic texts:
  1. digitally created (born digital)
  2. digitized
→ 2 different answers to the questions:
  1. technological progress
    • create access to material
    • support preservation policy
    • collection acquisition/completion
    • flexible use
    • institutional and strategic advantage (prestige)
    • research
    • education

2 different strategies

previous table of contents next
(c) Edward Vanhoutte 4 of 58 [73]
  1. Short term thinking
    • document creation/production
    • print
    • mail
    • document lay-out
    • ease of the word processor
    • WYSIWYG interface
    • → .rtf, .doc, .wpd, .pdf, .xls, .dmb
    • → Word, WordPerfect, Adobe Acrobat, Excel, Access
  2. Long term thinking
    • justify investments
    • retain access to material
    • maintain project
    • exchange of data
    • document structure
    • → ISO standards and W3C norms

Why electronic texts? Micro level

previous table of contents next
(c) Edward Vanhoutte 5 of 58 [73]
Your own research:
  • Research question
  • Specific history and story of your text
  • Your institutional context
  • Your mehodological network

BUT Text & Computer: a problematic relationship

previous table of contents next
(c) Edward Vanhoutte 6 of 58 [73]
Texts cannot be put into computers. Neither can numbers. Computers can contain and operate on patterns of electronic charges, but they cannot contain numbers, which are abstract mathematical objects not electronic charges, nor texts, which are complex, abstract cultural and linguistic objects.
→ Michael Sperberg-McQueen, 'Text in the Electronic Age: Textual Study and Text Encoding with examples from Medieval Texts.' Literary and Linguistic Computing, 6/1 (1991): 34-46. (34)

Computers work with a representation of text

previous table of contents next
(c) Edward Vanhoutte 7 of 58 [73]

Output

previous table of contents next
(c) Edward Vanhoutte 8 of 58 [73]
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.

Proprietary code (RTF)

previous table of contents next
(c) Edward Vanhoutte 9 of 58 [73]
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.
William Gibson é um guru do {\i science fiction} e {\i Neuromancer} é seu {\i masterpiece} do {\i cyberpunk}.

Binary Code

previous table of contents next
(c) Edward Vanhoutte 10 of 58 [73]
William Gibson é um guru do science fiction e Neuromancer é seu masterpiece do cyberpunk.
William Gibson é um guru do {\i science fiction} e {\i Neuromancer} é seu {\i masterpiece} do {\i cyberpunk}.
000101110000101000000011111010001111111011111 011001000000111111010100100010111101000100100 001111100100010011111110001010101010101010111 110001011100001010110001001000010000000010101 011110101001100010111000010101100010010000111

2 Main Problems

previous table of contents next
(c) Edward Vanhoutte 11 of 58 [73]
  • Interchange between systems and platforms causes loss of information
    → Short data life cycle
  • Loss of semantic information in translation to visual information

Example: Synoptic Edition

previous table of contents next
(c) Edward Vanhoutte 12 of 58 [73]

WP 5.1 → MS Word 2000

previous table of contents next
(c) Edward Vanhoutte 13 of 58 [73]

RTF → WP 9

previous table of contents next
(c) Edward Vanhoutte 14 of 58 [73]

Solution

previous table of contents next
(c) Edward Vanhoutte 15 of 58 [73]
An international(ly) accepted standard which:
  • is software and platform independent
  • can describe the logical, structural, and semantic elements of a text

Solution

previous table of contents next
(c) Edward Vanhoutte 16 of 58 [73]
An international(ly) accepted standard which:
  • is software and platform independent
  • can describe the logical, structural, and semantic elements of a text
→ Markup
William Gibson é um guru do
<language what="English">science fiction</language>
e <title>Neuromancer</title> é seu
<language what="English">masterpiece</language>
<term>cyberpunk</term>

XML: eXtensible Markup Language

previous table of contents next
(c) Edward Vanhoutte 17 of 58 [73]
XML is a metalanguage by which one can create separate markup languages for seperate purposes.
  • W3C recommendation
  • The standard on the internet
  • Widely used standard for markup and exchange of information

XML: What it is

previous table of contents next
(c) Edward Vanhoutte 18 of 58 [73]
  • XML is stuctured data, represented as strings of text
  • XML is platform-, software-, and system-independent
  • No one 'owns' XML
  • XML empowers the content provider and facilitates data integration, exchange, maintenance and extraction.
  • XML is an SGML system application
  • So it looks a bit like HTML, except that
    • XML is extensible
    • XML must be well-formed
    • XML can be validated

(S)GML - XML: A bit of history

previous table of contents next
(c) Edward Vanhoutte 19 of 58 [73]
  • Need for a platform and software independent format for data storage and interchange
  • IBM: Generalized Markup Language (GML)
  • ISO 8879:1986: Standard Generalized Markup Language (SGML): Charles Goldfarb
  • SGML is a meta-language: a language for the description of markup languages
  • Markup languages are commonly named after their
    • document type definition (DocBook)
    • aim (LinuxDoc)
    • application (TEI)
    • the standard they implement (J2003: automobile parts)
  • The DTD is the core of an SGML system
  • SGML is the solution for the industry: expensive and complicated but at the same time very powerful

What is a DTD (for now)?

previous table of contents next
(c) Edward Vanhoutte 20 of 58 [73]
Documents (texts) exist in different types:
  • Poetry
  • Prose
  • Drama
  • Letters
  • Bibliographies
  • Dictionaries
  • Lists
  • ...
→ Different organisation/order, different structures, different text ontology, different semantics
→ DTD provides the vocabulary and the syntax for a specific markup language

SGML - ISO 8879:1986

previous table of contents next
(c) Edward Vanhoutte 21 of 58 [73]
  • Comprehensiveness.
  • Simplicity.
  • Documents be processable by software of moderate complexity.
  • Standard not be dependent on any particular characteristic set or text-entry devise.
  • The standard not be geared to any particular analytic program or printing system.
  • The standard should describe text in editable form.
  • The standard allow the interchange of encoded texts across communication networks.
Standard reference work: Charles F. Goldfarb. (1990). The SGML Handbook. Oxford: Clarendon Press.

The Text Encoding Initiative (TEI)

previous table of contents next
(c) Edward Vanhoutte 22 of 58 [73]
11-12 November 1987: Vassar College, Poughkeepsie (NY)
→ 32 humanities scholars
Principles:
  • Platform-independent
  • Software-independent
  • Endurability
  • Re-usability
  • Accessibility
  • Language-independent
  • For all of the Humanities disciplines
→ SGML ISO 8879:1986
→ Sperberg-McQueen, C.M. and Burnard, L. (eds.) (2002). TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML Version (Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.)
→ Sperberg-McQueen, C. M. and Burnard, L. (eds.) (2005). TEI P5. Guidelines for Electronic Text Encoding and Interchange. Revised and re-edited. (Oxford, Providence, Charlottesville, Bergen: Text Encoding Initiative Consortium.) http://www.tei-c.org/release/doc/tei-p5-doc/html/

TEI Short History

previous table of contents next
(c) Edward Vanhoutte 23 of 58 [73]
  • 1987: Vassar College, Poughkeepsie
  • 1990: P1 : SGML
  • 1992: P2 : SGML
  • 1994: P3 : SGML : 600 + elements
  • 1995: TeiLite : SGML : 131 elements
  • 1999: P3rev : SGML
  • 2000: TEI Consortium
  • 2001: P4 : XML comp
  • 2001: TeixLite : XML
  • 2005-2007: P5 : XML

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 24 of 58 [73]

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 25 of 58 [73]

SGML - XML

previous table of contents next
(c) Edward Vanhoutte 26 of 58 [73]

So: Why would you want to learn about XML? The short answer:

previous table of contents next
(c) Edward Vanhoutte 27 of 58 [73]
→ XML is not the end of the world, and won't solve all your problems, but:
  • It's a good approximation
  • It works (fairly well)
  • It's widely supported
  • It's a W3C recommendation

Why would you want to learn about XML? The shortest answer:

previous table of contents next
(c) Edward Vanhoutte 28 of 58 [73]
→ It's fun.

Why XML? The longer answer:

previous table of contents next
(c) Edward Vanhoutte 29 of 58 [73]
  • Internationally suported (de facto) standard:
    • W3C recommendation
    • wide support
  • Software and platform independent
  • Describes the logical, structural, as well as semantic elements of a text
  • Relatively easy to master
  • Self declarative: for the machine and for human beings
  • Low creation cost thanks to open source support
  • Design of XML is formal and concise (Extended Backus-Naur Format - EBFN)
  • It is relatively easy to write programs which process XML documents.
  • Internationalisation: Unicode
  • Flexible output
  • Can be used with existing web protocols (HTTP) and mechanisms (URL)
  • Related standards in the XML family empower XML: XSLT, XSLFO, XPath, XQuery, Xlink, etc.

Why would you want to learn about XML?

previous table of contents next
(c) Edward Vanhoutte 30 of 58 [73]
→ XML will not live forever
  • XML will develop into ... ?
  • Other standards

Why would you want to learn about XML?

previous table of contents next
(c) Edward Vanhoutte 31 of 58 [73]
→ XML will not live forever
  • XML will develop into ... ?
  • Other standards
BUT
  • Migration will be supported
  • Migration will create more jobs

Why would you want to learn about TEI?

previous table of contents next
(c) Edward Vanhoutte 32 of 58 [73]
  • Best guarantee for intellectual sustainability of your research efforts.
  • Supported by a community
  • No 'take it or leave' it approach: extendible, modifiable
  • It's free

Workflow: mantra in 9 lines

previous table of contents next
(c) Edward Vanhoutte 33 of 58 [73]
  1. Project definition
  2. Document analysis
  3. Encoding design
  4. Coding
  5. Validating
  6. Functionality: storyboard
  7. Transform, generate, implement
  8. Document
  9. Publish

What is markup really?

previous table of contents next
(c) Edward Vanhoutte 34 of 58 [73]
Markup is a way to provide instructions about how a text should look like.
This implies:
  • stable agreements on ontology and semantics
  • a community who knows these agreements

Example of markup

previous table of contents next
(c) Edward Vanhoutte 35 of 58 [73]
eg.: Abraham Lincolns "Gettysburg Address"

Example of markup

previous table of contents next
(c) Edward Vanhoutte 36 of 58 [73]
eg.: Abraham Lincolns "Gettysburg Address"

Text encoding and markup

previous table of contents next
(c) Edward Vanhoutte 37 of 58 [73]
  • Texts are more than simply sequences of glyphs
    → They have structure and content and they also have multiple readings

Document analysis: Exercise

previous table of contents next
(c) Edward Vanhoutte 38 of 58 [73]

Workflow: mantra in 9 lines

previous table of contents next
(c) Edward Vanhoutte 39 of 58 [73]
  1. Project definition
  2. Document analysis
  3. Encoding design
  4. Encoding
  5. Validating
  6. Functionality: storyboard
  7. Transform, generate, implement
  8. Document
  9. Publish

Text encoding and markup

previous table of contents next
(c) Edward Vanhoutte 40 of 58 [73]
  • Texts are more than simply sequences of glyphs
    → They have structure and content and they also have multiple readings
  • Text encoding or markup provides a means of making such structure, content, and readings explicit
→ Only what is explicitly articulated can be digitally processed

Texts have structure and content 1

previous table of contents next
(c) Edward Vanhoutte 41 of 58 [73]
POOREDWARDDIDYOUHEARTHENEWSABOUTEDWARD?ONTHEBACKOFHISHE
ADHEHADANOTHERFACEWASITAWOMAN'SFACEORAYOUNGGIRLTHEYSAID
TOREMOVEITWOULDKILLHIMSOPOOREDWARDWASDOOMEDTHEFACECOULD
LAUGHANDCRYITWASHISDEVILTWINANDATNIGHTSHESPOKETOHIMTHIN
GSHEARDONLYINHELLBUTTHEYWEREIMPOSSIBLETOSEPARATECHAINED
TOGETHERFORLIFEFINALLYTHEBELLTOLLEDHISDOOMHETOOKASUITEO
FROOMSANDHUNGHIMSELFANDHERFROMTHEBALCONYIRONSSOMESTILLB
ELIEVEHEWASFREEDFROMHERBUTIKNEWHERTOOWELLISAYSHEDROVEHI
MTOSUICIDEANDTOOKPOOREDWARDTOHELL

Texts have structure and content 2

previous table of contents next
(c) Edward Vanhoutte 42 of 58 [73]
Poor Edward. Did you hear the news about Edward? On the back of
his head he had another face. Was it a woman's face or a
young girl? They said to remove it would kill him, So poor
Edward was doomed. The face could laugh and cry. It was his devil
twin. And at night she spoke to him things heard only in hell. But
they were impossible to separate, Chained together for life. Finally
the bell tolled his doom. He took a suite of rooms and hung himself
and her from the balcony irons. Some still believe he was freed
from her, but I knew her too well. I say she drove him to suicide,
and took poor Edward to hell.

Texts have structure and content 3

previous table of contents next
(c) Edward Vanhoutte 43 of 58 [73]
Poor Edward

Did you hear the news about Edward?
On the back of his head he had another face
Was it a woman's face or a young girl?
They said to remove it would kill him
So poor Edward was doomed

The face could laugh and cry
It was his devil twin
And at night she spoke to him
Things heard only in hell
But they were impossible to separate
Chained together for life

Finally the bell tolled his doom
He took a suite of rooms
And hung himself and her from the balcony irons
Some still believe he was freed from her
But I knew her too well
I say she drove him to suicide
And took poor Edward to hell

Texts have structure and content 4

previous table of contents next
(c) Edward Vanhoutte 44 of 58 [73]
Poor Edward
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>

Texts have structure and content 5

previous table of contents next
(c) Edward Vanhoutte 45 of 58 [73]
Poor Edward
<stanza>
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
</stanza>
<stanza>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
</stanza>
<stanza>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>
</stanza>

Texts have structure and content 6

previous table of contents next
(c) Edward Vanhoutte 46 of 58 [73]
<poem>
<title>Poor Edward</title>
<stanza>
<line>Did you hear the news about Edward?</line>
<line>On the back of his head he had another face</line>
<line>Was it a woman's face or a young girl?</line>
<line>They said to remove it would kill him</line>
<line>So poor Edward was doomed</line>
</stanza>
<stanza>
<line>The face could laugh and cry</line>
<line>It was his devil twin</line>
<line>And at night she spoke to him</line>
<line>Things heard only in hell</line>
<line>But they were impossible to separate</line>
<line>Chained together for life</line>
</stanza>
<stanza>
<line>Finally the bell tolled his doom</line>
<line>He took a suite of rooms</line>
<line>And hung himself and her from the balcony irons</line>
<line>Some still believe he was freed from her</line>
<line>But I knew her too well</line>
<line>I say she drove him to suicide</line>
<line>And took poor Edward to hell</line>
</stanza>
</poem>

Schematic

previous table of contents next
(c) Edward Vanhoutte 47 of 58 [73]

2 Operable Conditions

previous table of contents next
(c) Edward Vanhoutte 48 of 58 [73]
  • Markup should be separated from content
  • Markup should be processable → logical & predictable

Markup should be separated from content

previous table of contents next
(c) Edward Vanhoutte 49 of 58 [73]
Use of tags with open and close delimiters
<tag>content</tag>

Markup should be processable

previous table of contents next
(c) Edward Vanhoutte 50 of 58 [73]
  • Logical
  • Predictable
→ OHCO thesis

Russian Doll or OHCO Thesis

previous table of contents next
(c) Edward Vanhoutte 51 of 58 [73]
A document is an Ordered Hierarchy of Content Objects

Russian Doll or OHCO Thesis

previous table of contents next
(c) Edward Vanhoutte 52 of 58 [73]
A document is an Ordered Hierarchy of Content Objects

Markup Model for a Book

previous table of contents next
(c) Edward Vanhoutte 53 of 58 [73]

<book>
  <chapter n="1">
    <section n="1">
      <p>...</p>
      <p>...</p>
    </section>
    <section n="2">
      <p>...</p>
      <p>...</p>
    </section>
  </chapter>
  <chapter n="2">
    <!-- more sections and paragraphs -->
  </chapter>
</book>

Markup should be processable

previous table of contents next
(c) Edward Vanhoutte 54 of 58 [73]
  • Logical
  • Predictable
→ Document Type Definition (DTD) [or schema, see later]

Document Type Definition (DTD)

previous table of contents next
(c) Edward Vanhoutte 55 of 58 [73]
Different documenttypes → different organization/order & different content objects
  • Poetry
  • Prose
  • Drama
  • Letters
  • Bibliographies
  • Dictionaries
  • Lists
  • ...

Document Type Definition (DTD)

previous table of contents next
(c) Edward Vanhoutte 56 of 58 [73]
Different documenttypes → different organization/order & different content objects
  • Poetry
  • Prose
  • Drama
  • Letters
  • Bibliographies
  • Dictionaries
  • Lists
  • ...
→ The rules of the game

Document Type Definition (DTD)

previous table of contents next
(c) Edward Vanhoutte 57 of 58 [73]
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
  • names for all your elements
  • names and default values for their attributes
  • rules about how elements can nest
  • names for re-usable pieces of data (entities)
  • and a few other things
A DTD does not specify anything about what elements "mean"

Summary

previous table of contents next
(c) Edward Vanhoutte 58 of 58 [73]
  • Computers work with representations of text
  • Proprietary software merges structure, content, meaning, and layout in one code
    → 2 problems
    • Short data life cycle
    • Visual information = semantic information
  • Solution: standard for text encoding
    → Markup explicitly articulates structure, content, and readings
  • Markup should be separated from content
    → Tags
  • Markup should be processable
    → OHCO thesis
  • A markup language is defined by its Document Type Definition

II. XML: theory & practice

How do we recognize an XML document?

previous table of contents next
(c) Edward Vanhoutte 1 of 15 [73]
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
  • ?: Processing Instruction
  • xml: this is XML
  • version="1.0": specifies which version of the spec is used here

How do we recognize an XML document?

previous table of contents next
(c) Edward Vanhoutte 2 of 15 [73]
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
  • ?: Processing Instruction
  • xml: this is XML
  • version="1.0": specifies which version of the spec is used here
→ The root element
<root>Content</root>
This is the largest Russian doll containing all the others

Minimal XML document

previous table of contents next
(c) Edward Vanhoutte 3 of 15 [73]
<?xml version="1.0" ?>
<root>Hello world</root>

XML jargon

previous table of contents next
(c) Edward Vanhoutte 4 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA

XML jargon

previous table of contents next
(c) Edward Vanhoutte 5 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
<?xml version="1.0" ?>
<?xml-stylesheet href="../dtd/xsltslides.xsl" type="text/xsl" ?>

XML jargon

previous table of contents next
(c) Edward Vanhoutte 6 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
e.g.
<title> or </title>
→ Exception: empty elements: <empty />
→ XML is case sensitive in the naming of the elements. <title> is not <TITLE> is not <Title> is not <tItle> etc.

XML jargon

previous table of contents next
(c) Edward Vanhoutte 7 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
Elements can have one or more optional or required attributes which provide more information on the element.
Syntax:
<element.name  attribute.name="attribute.value">
e.g.
<title  type="journal">
<name  who="Edward"  reg="VanhoutteE" >

XML jargon

previous table of contents next
(c) Edward Vanhoutte 8 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
An entity is a named sequence of characters, predefined for convenience. Typical uses include:
  • to represent characters which cannot reliably be typed in
  • as a shortcut for boiler plate text
  • containers for external (non-XML) data such as graphics
Syntax:
&entity.name;
e.g.
&eacute; = &#233; = é
&uuml; = &#252; = ü
&mayonnaise; = reference to an external recipe (text) for mayonnaise
<figure  entity="ascii-full" />

Character entities

previous table of contents next
(c) Edward Vanhoutte 9 of 15 [73]
Characters such as & (&amp;) and < (&lt;) which are used by XML, or non-ascii characters can be escaped by the use of set entity references. ISO 8859, ISO/IEC 10646 (decimal or hexadecimal), Unicode.
Note on ISO/IEC 10646
  • XML requires the use of ISO/IEC 10646 or Unicode
    • Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it.
    • Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications". (section Unicode & ISO 10646 of the Unicode FAQ <http://www.unicode.org/unicode/faq/>)
    • a 32 bit character repertoire including most human writing systems
    • encoded as UTF8 (Web), UTF16 (Java and Windows), or UTF32 (Unix).
  • Other encodings may be specified at the document level
ISO 8859 references can still be used in the XML document (Isolat1, Isolat2, Isopub, Isonum) when the entities are internally or externally (in the DTD) mapped on Unicode

ASCII

previous table of contents next
(c) Edward Vanhoutte 10 of 15 [73]

ISO 8859-1 (IsoLat1)

previous table of contents next
(c) Edward Vanhoutte 11 of 15 [73]

ISO 10646 - Unicode

previous table of contents next
(c) Edward Vanhoutte 12 of 15 [73]

Entity Declaration

previous table of contents next
(c) Edward Vanhoutte 13 of 15 [73]
Syntax
<!ENTITY eacute  "&#233;" ><!-- small e, acute accent -->
<!ENTITY TEI  "Text Encoding Initiative">
<!ENTITY allc SYSTEM "../graphics/allc.gif" NDATA gif>

XML jargon

previous table of contents next
(c) Edward Vanhoutte 14 of 15 [73]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
Character data: allows you to include application code (JavaScript, Perl, BASIC, etc.) in an XML document without having to worry about escaping characters. A CDATA section is treated by the XML processor as it appears
Syntax
<![CDATA[This text escapes the XML processor]]>

previous table of contents next
(c) Edward Vanhoutte 15 of 15 [73]