B33080 Humanities Computing: Electronic Text
University of Antwerp, Campus Drie Eiken
Second Term 2006
Edward Vanhoutte
edward.vanhoutte@kantl.be

B33080 Humanities Computing: Electronic Text
Week 5: XML theory and practice: valid XML - parsing/validating - TeixLite.
University of Antwerp, Campus Drie Eiken
Monday 13 March
Edward Vanhoutte
II. Monday 13 March Revision of week 4.
- Revision of week 2
- Standard Generalized Markup Language (SGML)
- Text Encoding Initiative (TEI)
- eXtensible Markup Language (XML)
- What it is
- SGML - XML
- Input & output
- Production Process
- XML: theory & practice
- How do we recognize an XML document?
- Minimal XML document
- Jargon
How do we recognize an XML document? | |
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
- ?: Processing Instruction
- xml: this is XML
- version="1.0": specifies which version of the spec is used here
How do we recognize an XML document? | |
→ The XML declaration
<?xml version="1.0" ?>
This tells you:
- ?: Processing Instruction
- xml: this is XML
- version="1.0": specifies which version of the spec is used here
→ The root element
<root>Content</root>
This is the largest Russian doll containing all the others
<?xml version="1.0" ?>
<root>Hello world</root>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
<?xml version="1.0" ?>
<?xml-stylesheet href="../dtd/xsltslides.xsl" type="text/xsl" ?>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
e.g.
<title> or </title>
→ Exception: empty elements: <empty />
→ XML is case sensitive in the naming of the elements. <title> is not <TITLE> is not <Title> is not <tItle> etc.
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
Elements can have one or more optional or required attributes which provide more information on the element.
Syntax:
<element.name attribute.name="attribute.value">
e.g.
<title type="journal">
<name who="Edward" reg="VanhoutteE" >
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
An entity is a named sequence of characters, predefined for convenience. Typical uses include:
- to represent characters which cannot reliably be typed in
- as a shortcut for boiler plate text
- containers for external (non-XML) data such as graphics
Syntax:
&entity.name;
e.g.
é = é = é
ü = ü = ü
&mayonnaise; = reference to an external recipe (text) for mayonnaise
<figure entity="ascii-full" />
Characters such as & (&) and < (<) which are used by XML, or non-ascii characters can be escaped by the use of set entity references. ISO 8859, ISO/IEC 10646 (decimal or hexadecimal), Unicode.
Note on ISO/IEC 10646
- XML requires the use of ISO/IEC 10646 or Unicode
- Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it.
- Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications". (section Unicode & ISO 10646 of the Unicode FAQ <http://www.unicode.org/unicode/faq/>)
- a 32 bit character repertoire including most human writing systems
- encoded as UTF8 (Web), UTF16 (Java and Windows), or UTF32 (Unix).
- Other encodings may be specified at the document level
ISO 8859 references can still be used in the XML document (Isolat1, Isolat2, Isopub, Isonum) when the entities are internally or externally (in the DTD) mapped on Unicode
Syntax
<!ENTITY eacute "é" ><!-- small e, acute accent -->
<!ENTITY TEI "Text Encoding Initiative">
<!ENTITY allc SYSTEM "graphics/allc.gif" NDATA gif>
An XML document contains
- Processing Instructions
- Elements
- Attributes (optional)
- Entity References
- CDATA
Character data: allows you to include application code (JavaScript, Perl, BASIC, etc.) in an XML document without having to worry about escaping characters. A CDATA section is treated by the XML processor as it appears
Syntax
<![CDATA[This text escapes the XML processor]]>
- XML is case sensitive
- At least one element
- There is always a root element
- All logical and physical structures nest properly
- Correspondence of element names in start- and end-tags
- Attribute names only appear once in a start-tag
- Attribute values are quoted
- Attribute values don't refer to external entities
- Entities are declared
- No entity reference contains the name of a non-parsed entity
<?xml version="1.0" ?>
<greeting>
<salutation>hello</salutation>
<target>world</target>
</greeting>
- <greeting>Hello world!</greeting>
- <greeting>Hello world!<Greeting>
- <greeting>Hello world!<greeting />
- <greeting><salutation>hello<target></salutation>world</target></greeting>
- <greeting><salutation>hello</salutation><target>world</target></greeting>
- <greeting><Salutation>hello</salutation><target>world</Target><greeting>
- <greeting><salutation>hello</salutation<target>world</target></greeting>
- <salutation volume=loud>hello!</salutation>
- <salutation volume="loud" volume="10">hello!</salutation>
- <salutation volume="loud">hello!</salutation>
- <greeting type="non-verbal">
- <greeting type="non-verbal" />
- <Greeting type="non-verbal" />
Well Formed XML: Exercise | |
Subject: my dog
Date: Mon, 7 Feb 2004 13:57:24 +0100
From: "thedoglady@wanteddogs.org"
To: "edward.vanhoutte@kantl.be"
Sir,
Last week I lost my dog and don't know where she is now.
Normally I keep it in a bag under my arm, but now she's gone.
Could you please check all your bags and look whether you can find my dog?
Very many thanks in advance
Lady D. Og
Department of lost dogs
Ministry of dogs and bags, Dogtown
thedoglady@wanteddogs.org
<?xml version="1.0" encoding="UTF-8"?>
<email>
<header>
<subj>my dog</subj>
<date>Mon, 7 Feb 2004 13:57:24 +0100</date>
<from>"thedoglady@wanteddogs.org"
<thedoglady@wanteddogs.org></from>
<to>To: "edward.vanhoutte@kantl.be" <edward.vanhoutte@kantl.be></to>
</header>
<body>
<open>Sir,</open>
<p>Last week I lost my dog and don't know where she is now.</p>
<p>Normally I keep it in a bag under my arm, but now she's gone.</p>
<p>Could you please check all your bags and look whether you can find my dog?</p>
<close>Very many thanks in advance</close>
<sign>
<name>Lady D. Og</name>
<address>
<addrLine>Department of lost dogs</addrLine>
<addrLine>Ministry of dogs and bags, Dogtown</addrLine>
<addrLine>thedoglady@wanteddogs.org</addrLine>
</address>
</sign>
</body>
</email>
A valid XML document will reference a Document Type Declaration
<!DOCTYPE TEI.2 PUBLIC "-//TEI//DTD TEI Lite 1.0//EN"
"../dtd/xmllite.dtd"
[
]>
Document Type Declaration | |
Contains the reference to a Document Type Definition (DTD)
A DTD specifies the vocabulary and the syntax of a markup language
It defines:
- names for all your elements
- names and default values for their attributes
- rules about how elements can nest
- names for re-usable pieces of data (entities)
- and a few other things
A DTD does not specify anything about what elements "mean"
Defining an element in a DTD | |
<!ELEMENT name contentModel>
- name is the name of the element (GI)
- contentModel defines valid content for the element:
- #PCDATA
- EMPTY
- other elements
- mixed content combining PCDATA and other elements
Document Type Declaration | |
Contains the reference to a Document Type Definition (DTD)
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
<salutation>hello</salutation>
<target>world</target>
</greeting>
Document Type Declaration | |
Contains the reference to a Document Type Definition (DTD)
→ The formal specification for the structure of an XML document
<?xml version="1.0" ?>
<greeting>
<salutation>hello</salutation>
<target>world</target>
</greeting>
DTD:
<!ELEMENT greeting (salutation, target)>
<!ELEMENT salutation (#PCDATA)>
<!ELEMENT target (#PCDATA)>
- <!ELEMENT a (b+)>
- <!ELEMENT b EMPTY>
- <!ELEMENT c (#PCDATA)>
- <!ELEMENT a (b,c)>
- <!ELEMENT a (b,c)+>
- <!ELEMENT a (b,c+)>
- <!ELEMENT a (b,c)*>
- <!ELEMENT a (b,c*)>
- <!ELEMENT a (b?,c)>
- <!ELEMENT a (b?,c)+>
- <!ELEMENT a (b?,c)*>
- <!ELEMENT a (b|c)>
- <!ELEMENT a (b|c)+>
- <!ELEMENT a (b|c+)>
- <!ELEMENT a (b|c)*>
- <!ELEMENT a (b|c*)>
- <!ELEMENT a (#PCDATA|b|c)*>
- <!ELEMENT a (b, (c|d))*>
- <!ELEMENT a (b?, (c|d)+)>
- <!ELEMENT a (b?, (c+|d+))>
Real Example: Anthology DTD | |
<!ELEMENT anthology (poem,poem+)>
<!ELEMENT poem (title?, stanza+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT stanza (line)+>
<!ELEMENT line (#PCDATA)>
IV. Week 5
→ write a DTD for the following document and save it as c:\lab\email.dtd
<?xml version="1.0" encoding="UTF-8" ?>
<email>
<header>
<subj>my dog</subj>
<date>Mon, 7 Feb 2004 13:57:24 +0100</date>
<from>"thedoglady@wanteddogs.org"
<thedoglady@wanteddogs.org></from>
<to>"edward.vanhoutte@kantl.be" <edward.vanhoutte@kantl.be></to>
</header>
<body>
<open>Sir,</open>
<p>Last week I lost my dog and don't know where she is now.</p>
<p>Normally I keep it in a bag under my arm, but now she's gone.</p>
<p>Could you please check all your bags and look whether you can find my dog?</p>
<close>Very many thanks in advance</close>
<sign>
<name>Lady D. Og</name>
<address>
<addrLine>Department of lost dogs</addrLine>
<addrLine>Ministry of dogs and bags, Dogtown</addrLine>
<addrLine>thedoglady@wanteddogs.org</addrLine>
</address>
</sign>
</body>
</email>
<!ELEMENT email (header, body)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (addrline)+>
<!ELEMENT email (header, body)>
<!ELEMENT header (subj, date, from, to)>
<!ELEMENT subj (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT body (open | p | ps | close | sign)*>
<!ELEMENT open (#PCDATA)>
<!ELEMENT p (#PCDATA)>
<!ELEMENT close (#PCDATA)>
<!ELEMENT ps (#PCDATA)>
<!ELEMENT sign (name | address)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (addrline)+>
<!ELEMENT addrline (#PCDATA)>
XML can be validated when we have:
- an XML document
- a DTD
- a validating parser
Make sure:
- the Doctype Declaration inside the XML document refers to the appropriate DTD and its path
- the parser can find an XML declaration
Validating with NSGMLS (SP) | |
check whether the following files are in the bin folder of the program, and download from
Validating with NSGMLS (SP) | |
Validation using windows interface:
- Specify where nsgmls can find the catalog file under Options in the toolbar of runsp2.
- Specify where nsgmls can find xml.dcl under Options in the toolbar of runsp2.
Validation using command-line interface (DOS):
- move focus to correct folder, eg: "cd c:\lab"
- make sure file "catalog" is located there, containing single piece of information: "SGMLDECL c:\sp\bin\xml.dcl"
- command: "c:\sp\bin\nsgmls -s [file].xml"
- email.xml
- email.dtd
- email.xsl
- email.css
- poem.xml
- poem.xsl
- teixlite.dtd
- iso-lat1.ent
- iso-lat2.ent
- iso-num.ent
- iso-pub.ent
- http://www.kantl.be/ctb/vanhoutte/teach/hc2006.htm (Week 5)
- Copy email.xml to C:\lab\
- Copy email.dtd to C:\lab\
- Copy iso-lat1.ent, iso-lat2.ent, iso-num.ent & iso-pub.ent to C:\lab\
- Validate with NSGMLS (SP)
- CD-ROM
- Copy exx/email.xml to C:\lab\
- Copy exx/email.dtd to C:\lab\
- Copy exx/iso-lat1.ent, exx/iso-lat2.ent, exx/iso-num.ent & exx/iso-pub.ent to C:\lab\
- Validate with NSGMLS (SP)
- Validate with Open XML Editor
- Hike your neighbour's computer
- Sneak some mistakes in his/her email.xml document
- Let him/her validate and look for the mistakes