B33080 Humanities Computing: Electronic Text

University of Antwerp, Campus Drie Eiken

Second Term 2006

Edward Vanhoutte

edward.vanhoutte@kantl.be

TOC | First

B33080 Humanities Computing: Electronic Text

Week 5: XML theory and practice: valid XML - parsing/validating - TeixLite.

University of Antwerp, Campus Drie Eiken

Monday 13 March

Edward Vanhoutte

TOC | First


I. Monday 13 March Overview

Monday 13 March: Overview

previous table of contents next
1 of 2 [56]
  1. Revision of week 4
  2. XML: theory & practice
    • DTD
    • Valid XML
    • Validating XML

Goals of this lecture

previous table of contents next
2 of 2 [56]
After this lecture, you should be able to
  • create a well-formed XML document
  • create a valid XML document
  • parse an XML document for validation
  • read and interpret an XML document
  • create and read small DTDs
  • use and understand TeixLite
  • create TeixLite documents
  • parse TeixLite documents for validation

II. Monday 13 March Revision of week 4.

1. Revision of week 4

previous table of contents next
1 of 15 [56]
  1. Revision of week 2
  2. Standard Generalized Markup Language (SGML)
  3. Text Encoding Initiative (TEI)
  4. eXtensible Markup Language (XML)
    • What it is
    • SGML - XML
    • Input & output
    • Production Process
  5. XML: theory & practice
    • How do we recognize an XML document?
    • Minimal XML document
    • Jargon

How do we recognize an XML document?

previous table of contents next
2 of 15 [56]
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
  • ?: Processing Instruction
  • xml: this is XML
  • version="1.0": specifies which version of the spec is used here

How do we recognize an XML document?

previous table of contents next
3 of 15 [56]
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
  • ?: Processing Instruction
  • xml: this is XML
  • version="1.0": specifies which version of the spec is used here
→ The root element
<root>Content</root>
This is the largest Russian doll containing all the others

Minimal XML document

previous table of contents next
4 of 15 [56]
<?xml version="1.0" ?>
<root>Hello world</root>

XML jargon

previous table of contents next
5 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA

XML jargon

previous table of contents next
6 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
<?xml version="1.0" ?>
<?xml-stylesheet href="../dtd/xsltslides.xsl" type="text/xsl" ?>

XML jargon

previous table of contents next
7 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
e.g.
<title> or </title>
→ Exception: empty elements: <empty />
→ XML is case sensitive in the naming of the elements. <title> is not <TITLE> is not <Title> is not <tItle> etc.

XML jargon

previous table of contents next
8 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
Elements can have one or more optional or required attributes which provide more information on the element.
Syntax:
<element.name  attribute.name="attribute.value">
e.g.
<title  type="journal">
<name  who="Edward"  reg="VanhoutteE" >

XML jargon

previous table of contents next
9 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
An entity is a named sequence of characters, predefined for convenience. Typical uses include:
  • to represent characters which cannot reliably be typed in
  • as a shortcut for boiler plate text
  • containers for external (non-XML) data such as graphics
Syntax:
&entity.name;
e.g.
&eacute; = &#233; = é
&uuml; = &#252; = ü
&mayonnaise; = reference to an external recipe (text) for mayonnaise
<figure  entity="ascii-full" />

Character entities

previous table of contents next
10 of 15 [56]
Characters such as & (&amp;) and < (&lt;) which are used by XML, or non-ascii characters can be escaped by the use of set entity references. ISO 8859, ISO/IEC 10646 (decimal or hexadecimal), Unicode.
Note on ISO/IEC 10646
  • XML requires the use of ISO/IEC 10646 or Unicode
    • Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it.
    • Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications". (section Unicode & ISO 10646 of the Unicode FAQ <http://www.unicode.org/unicode/faq/>)
    • a 32 bit character repertoire including most human writing systems
    • encoded as UTF8 (Web), UTF16 (Java and Windows), or UTF32 (Unix).
  • Other encodings may be specified at the document level
ISO 8859 references can still be used in the XML document (Isolat1, Isolat2, Isopub, Isonum) when the entities are internally or externally (in the DTD) mapped on Unicode

ASCII

previous table of contents next
11 of 15 [56]

ISO 8859-1 (IsoLat1)

previous table of contents next
12 of 15 [56]

ISO 10646 - Unicode

previous table of contents next
13 of 15 [56]

Entity Declaration

previous table of contents next
14 of 15 [56]
Syntax
<!ENTITY eacute  "&#233;" ><!-- small e, acute accent -->
<!ENTITY TEI  "Text Encoding Initiative">
<!ENTITY allc SYSTEM "graphics/allc.gif" NDATA gif>

XML jargon

previous table of contents next
15 of 15 [56]
An XML document contains
  • Processing Instructions
  • Elements
  • Attributes (optional)
  • Entity References
  • CDATA
Character data: allows you to include application code (JavaScript, Perl, BASIC, etc.) in an XML document without having to worry about escaping characters. A CDATA section is treated by the XML processor as it appears
Syntax
<![CDATA[This text escapes the XML processor]]>

Well Formed XML

previous table of contents next
1 of 16 [56]
  1. XML is case sensitive
  2. At least one element
  3. There is always a root element
  4. All logical and physical structures nest properly
  5. Correspondence of element names in start- and end-tags
  6. Attribute names only appear once in a start-tag
  7. Attribute values are quoted
  8. Attribute values don't refer to external entities
  9. Entities are declared
  10. No entity reference contains the name of a non-parsed entity

Well Formed XML: Example

previous table of contents next
2 of 16 [56]
<?xml version="1.0" ?>
 <greeting>
    <salutation>hello</salutation>
    <target>world</target>
 </greeting>

Spot the mistakes

previous table of contents next
3 of 16 [56]
  1. <greeting>Hello world!</greeting>
  2. <greeting>Hello world!<Greeting>
  3. <greeting>Hello world!<greeting />
  1. <greeting><salutation>hello<target></salutation>world</target></greeting>
  2. <greeting><salutation>hello</salutation><target>world</target></greeting>
  3. <greeting><Salutation>hello</salutation><target>world</Target><greeting>
  4. <greeting><salutation>hello</salutation<target>world</target></greeting>
  1. <salutation volume=loud>hello!</salutation>
  2. <salutation volume="loud" volume="10">hello!</salutation>
  3. <salutation volume="loud">hello!</salutation>
  1. <greeting type="non-verbal">
  2. <greeting type="non-verbal" />
  3. <Greeting type="non-verbal" />

Well Formed XML: Exercise

previous table of contents next
4 of 16 [56]
Subject: my dog
Date: Mon, 7 Feb 2004 13:57:24 +0100
From: "thedoglady@wanteddogs.org"
To: "edward.vanhoutte@kantl.be"
Sir,

Last week I lost my dog and don't know where she is now.

Normally I keep it in a bag under my arm, but now she's gone.

Could you please check all your bags and look whether you can find my dog?

Very many thanks in advance


Lady D. Og

Department of lost dogs
Ministry of dogs and bags, Dogtown
thedoglady@wanteddogs.org

Proposal

previous table of contents next
5 of 16 [56]
<?xml version="1.0" encoding="UTF-8"?>
<email>
 <header>
   <subj>my dog</subj>
   <date>Mon, 7 Feb 2004 13:57:24 +0100</date>
   <from>"thedoglady@wanteddogs.org"
&lt;thedoglady@wanteddogs.org&gt;</from>
   <to>To: "edward.vanhoutte@kantl.be" &lt;edward.vanhoutte@kantl.be&gt;</to>
 </header>
 <body>
   <open>Sir,</open>
   <p>Last week I lost my dog and don't know where she is now.</p>
    <p>Normally I keep it in a bag under my arm, but now she's gone.</p>
   <p>Could you please check all your bags and look whether you can find my dog?</p>
   <close>Very many thanks in advance</close>
   <sign>
     <name>Lady D. Og</name>
     <address>
       <addrLine>Department of lost dogs</addrLine>
       <addrLine>Ministry of dogs and bags, Dogtown</addrLine>
       <addrLine>thedoglady@wanteddogs.org</addrLine>
     </address>
   </sign>
 </body>
</email>

Valid XML

previous table of contents next
6 of 16 [56]
A valid XML document will reference a Document Type Declaration
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite 1.0//EN"
    "../dtd/xmllite.dtd"
    [
    ]>

Document Type Declaration

previous table of contents next
7 of 16 [56]
Contains the reference to a Document Type Definition (DTD)
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
  • names for all your elements
  • names and default values for their attributes
  • rules about how elements can nest
  • names for re-usable pieces of data (entities)
  • and a few other things
A DTD does not specify anything about what elements "mean"

Defining an element in a DTD

previous table of contents next
8 of 16 [56]
<!ELEMENT  name  contentModel>
  • name is the name of the element (GI)
  • contentModel defines valid content for the element:
    • #PCDATA
    • EMPTY
    • other elements
    • mixed content combining PCDATA and other elements

Document Type Declaration

previous table of contents next
9 of 16 [56]
Contains the reference to a Document Type Definition (DTD)
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
 <greeting>
    <salutation>hello</salutation>
    <target>world</target>
 </greeting>

Document Type Declaration

previous table of contents next
10 of 16 [56]
Contains the reference to a Document Type Definition (DTD)
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
 <greeting>
    <salutation>hello</salutation>
    <target>world</target>
 </greeting>
DTD:
<!ELEMENT    greeting      (salutation, target)>
<!ELEMENT    salutation    (#PCDATA)>
<!ELEMENT    target        (#PCDATA)>

previous table of contents next
11 of 16 [56]

previous table of contents next
12 of 16 [56]

previous table of contents next
13 of 16 [56]

DTD toolkit

previous table of contents next
14 of 16 [56]

Examples

previous table of contents next
15 of 16 [56]
  1. <!ELEMENT a (b+)>
  2. <!ELEMENT b EMPTY>
  3. <!ELEMENT c (#PCDATA)>
  4. <!ELEMENT a (b,c)>
  5. <!ELEMENT a (b,c)+>
  6. <!ELEMENT a (b,c+)>
  7. <!ELEMENT a (b,c)*>
  8. <!ELEMENT a (b,c*)>
  9. <!ELEMENT a (b?,c)>
  10. <!ELEMENT a (b?,c)+>
  11. <!ELEMENT a (b?,c)*>
  12. <!ELEMENT a (b|c)>
  13. <!ELEMENT a (b|c)+>
  14. <!ELEMENT a (b|c+)>
  15. <!ELEMENT a (b|c)*>
  16. <!ELEMENT a (b|c*)>
  17. <!ELEMENT a (#PCDATA|b|c)*>
  18. <!ELEMENT a (b, (c|d))*>
  19. <!ELEMENT a (b?, (c|d)+)>
  20. <!ELEMENT a (b?, (c+|d+))>

Real Example: Anthology DTD

previous table of contents next
16 of 16 [56]
<!ELEMENT   anthology    (poem,poem+)>
<!ELEMENT   poem         (title?, stanza+)>
<!ELEMENT   title        (#PCDATA)>
<!ELEMENT   stanza       (line)+>
<!ELEMENT   line         (#PCDATA)>

IV. Week 5

Exercise

previous table of contents next
1 of 23 [56]
→ write a DTD for the following document and save it as c:\lab\email.dtd
<?xml version="1.0" encoding="UTF-8" ?>
<email>
 <header>
   <subj>my dog</subj>
   <date>Mon, 7 Feb 2004 13:57:24 +0100</date>
   <from>"thedoglady@wanteddogs.org"
&lt;thedoglady@wanteddogs.org&gt;</from>
   <to>"edward.vanhoutte@kantl.be" &lt;edward.vanhoutte@kantl.be&gt;</to>
 </header>
 <body>
   <open>Sir,</open>
   <p>Last week I lost my dog and don't know where she is now.</p>
    <p>Normally I keep it in a bag under my arm, but now she's gone.</p>
   <p>Could you please check all your bags and look whether you can find my dog?</p>
   <close>Very many thanks in advance</close>
   <sign>
     <name>Lady D. Og</name>
     <address>
       <addrLine>Department of lost dogs</addrLine>
       <addrLine>Ministry of dogs and bags, Dogtown</addrLine>
       <addrLine>thedoglady@wanteddogs.org</addrLine>
     </address>
   </sign>
 </body>
</email>

email DTD

previous table of contents next
2 of 23 [56]
<!ELEMENT    email     (header, body)>

email DTD

previous table of contents next
3 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>

email DTD

previous table of contents next
4 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>

email DTD

previous table of contents next
5 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>

email DTD

previous table of contents next
6 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>

email DTD

previous table of contents next
7 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>

email DTD

previous table of contents next
8 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>

email DTD

previous table of contents next
9 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>

email DTD

previous table of contents next
10 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>

email DTD

previous table of contents next
11 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>

email DTD

previous table of contents next
12 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>
<!ELEMENT    ps        (#PCDATA)>

email DTD

previous table of contents next
13 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>
<!ELEMENT    ps        (#PCDATA)>
<!ELEMENT    sign      (name | address)*>

email DTD

previous table of contents next
14 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>
<!ELEMENT    ps        (#PCDATA)>
<!ELEMENT    sign      (name | address)*>
<!ELEMENT    name      (#PCDATA)>

email DTD

previous table of contents next
15 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>
<!ELEMENT    ps        (#PCDATA)>
<!ELEMENT    sign      (name | address)*>
<!ELEMENT    name      (#PCDATA)>
<!ELEMENT    address   (addrline)+>

email DTD

previous table of contents next
16 of 23 [56]
<!ELEMENT    email     (header, body)>
<!ELEMENT    header    (subj, date, from, to)>
<!ELEMENT    subj      (#PCDATA)>
<!ELEMENT    date      (#PCDATA)>
<!ELEMENT    from      (#PCDATA)>
<!ELEMENT    to        (#PCDATA)>
<!ELEMENT    body      (open | p | ps | close | sign)*>
<!ELEMENT    open      (#PCDATA)>
<!ELEMENT    p         (#PCDATA)>
<!ELEMENT    close     (#PCDATA)>
<!ELEMENT    ps        (#PCDATA)>
<!ELEMENT    sign      (name | address)*>
<!ELEMENT    name      (#PCDATA)>
<!ELEMENT    address   (addrline)+>
<!ELEMENT    addrline  (#PCDATA)>

Validating XML

previous table of contents next
17 of 23 [56]
XML can be validated when we have:
  • an XML document
  • a DTD
  • a validating parser
Make sure:
  • the Doctype Declaration inside the XML document refers to the appropriate DTD and its path
  • the parser can find an XML declaration

Validating with NSGMLS (SP)

previous table of contents next
18 of 23 [56]
check whether the following files are in the bin folder of the program, and download from

Validating with NSGMLS (SP)

previous table of contents next
19 of 23 [56]
Validation using windows interface:
  • Specify where nsgmls can find the catalog file under Options in the toolbar of runsp2.
  • Specify where nsgmls can find xml.dcl under Options in the toolbar of runsp2.
Validation using command-line interface (DOS):
  • move focus to correct folder, eg: "cd c:\lab"
  • make sure file "catalog" is located there, containing single piece of information: "SGMLDECL c:\sp\bin\xml.dcl"
  • command: "c:\sp\bin\nsgmls -s [file].xml"

A touch of style

previous table of contents next
20 of 23 [56]
download the following files from http://www.kantl.be/ctb/vanhoutte/teach/hc2006.htm (week 5) to C:\lab:
  • email.xml
  • email.dtd
  • email.xsl
  • email.css
  • poem.xml
  • poem.xsl
  • teixlite.dtd
  • iso-lat1.ent
  • iso-lat2.ent
  • iso-num.ent
  • iso-pub.ent

Exercise 1

previous table of contents next
21 of 23 [56]
  • http://www.kantl.be/ctb/vanhoutte/teach/hc2006.htm (Week 5)
    • Copy email.xml to C:\lab\
    • Copy email.dtd to C:\lab\
    • Copy iso-lat1.ent, iso-lat2.ent, iso-num.ent & iso-pub.ent to C:\lab\
    • Validate with NSGMLS (SP)
  • CD-ROM
    • Copy exx/email.xml to C:\lab\
    • Copy exx/email.dtd to C:\lab\
    • Copy exx/iso-lat1.ent, exx/iso-lat2.ent, exx/iso-num.ent & exx/iso-pub.ent to C:\lab\
    • Validate with NSGMLS (SP)

Exercise 2

previous table of contents next
22 of 23 [56]
  • Validate with Open XML Editor

Exercise 3

previous table of contents next
23 of 23 [56]
  • Hike your neighbour's computer
  • Sneak some mistakes in his/her email.xml document
  • Let him/her validate and look for the mistakes