logo CTB Centrum voor Teksteditie en Bronnenstudie
Koningstraat 18
b-9000 Gent
Belgium
email: ctb@kantl.be
tel: +32 (0)9 265 93 50
fax: +32 (0)9 265 93 49

Het CTB organiseert het Seminar in Electronic Editing waarop internationale sprekers lezingen en demonstraties geven. De seminars zijn vrij toegankelijk en vinden plaats in het gebouw van de Koninklijke Academie voor Nederlandse Taal- en Letterkunde in de Koningstraat 18 te Gent. Op het seminar van 6 september 2002 spreken Joke Debusschere, Vincent Neyt en Edward Vanhoutte.

Vincent Neyt
On the Advantages and Disadvantages of XSLT for Manuscript Transcription.

The point of departure is the electronic edition of one of Joyce's notebooks for Finnegans Wake. The 'Guiltless notebook' (47471b, so called because the first word is 'Guiltless') is about 90 pages long and contains very early draft versions of chapters 2, 3, 4, 5, 7 and 8 of Book I. Over a number of years the James Joyce Center at the University of Antwerp (UIA) has made a full transcription of this notebook using TEI-conformant XML/SGML. In a first phase this highly complex collection of textual units was approached from a teleological standpoint: the transcriptions were made on the basis of the narrative structure of the novel as it was first published years later, and encoded in order to leave open the possibility of other approaches (chronological or document-oriented) in a second phase.[1]

A first point of focus in this paper will be the transcription itself. Seamus Deane called Finnegans Wake "in an important sense, unreadable" (Deane 1992, vii) and the same applies to Joyce's notebooks. Joyce intended to write only on the recto pages, but almost immediately started using the verso pages for additions too lengthy to write in between the lines or in the margins, and soon thereafter scribbled whole additional paragraphs on the verso's, all linked to each other by means of a confusing and unsystematic complex of lines and sigla. Because of the complexity of this manuscript, we've had to stretch the available tags defined in the full XML-ized TEI Document Type Definition up to and beyond the limits and decide quite randomly when to store information in elements, attributes or as data content.[2] Specifically for encoding additions and deletions on a manuscript page, TEI offers insufficient distinctive attributes for the editor to include the information about the type of addition/deletion he should be able to provide. These limitations imposed by the TEI DTDs endanger the third "overarching goal" C. M. Sperberg-McQueen proposes for 'serious electronic editions': "accessibility, longevity, and intellectual integrity". Sperberg-McQueen correctly argues that accessibility and longevity are secured by the TEI encoding scheme, but has to admit that: "The intellectual integrity of materials encoded with the TEI encoding scheme is harder to guarantee." (Sperberg-McQueen 1994).

The following sample code demonstrates how hard it is to encode a clear distinction between additions Joyce made immediately (and wrote inline) or added later while rereading (and most of the time wrote above or below the line):

He had left the country <del type="O" rend="overstrike" resp="DVH">by </del>
<add place="inline" hand="JJ" resp="DVH"><hi rend="italic">via</hi></add>
a subterranean tunnel<add place="facingleaf, 47471b-27v" rend="$I" hand="JJ" resp="DVH">
<del type="S" rend="overstrike" resp="DVH"> lined </del>
<xref doc="shored"><add place="supralinear, 47471b-27v" hand="JJ" resp="DVH">shored </add>
with bedboards</xref>.
<add place="marginleft, 47471b-27v" hand="JJ" resp="DVH">An infamous private ailment
(vario<add place="supralinear, 47471b-27v" hand="JJ" resp="DVH">lo</add>venereal)
had claimed him.</add></add>

A second point of focus will be the usability of this transcription: can this XML archive be used to automate or generate all the different visualisations of the manuscript material textual critics have come to expect in an electronic edition? Or is our encoding so dependent on the way we structured the material that it can no longer be broken down and restructured automatically? With reasonable success, I've been using XSL Transformations to this end. I've transformed the teleologically structured files (section per section) to manuscript-orientated files (page per page) through three consecutively run XSLTs. I've generated unique id's in certain tags, generated new elements from attribute values and automatically linked them to corresponding anchors in the edition.

XSLT is a very powerful tool. It has however been developed for and by people using it to extract information from databases. In the humanities we deal with texts. In a database all information is stored mainly in elements, there is no advantage in 'locking away' information as an attribute value. Encoded literary texts on the other hand are linear and have to remain legible at all times, so any information the editor wishes to add, ends up in attributes. This results in very 'heavy' tags limiting the possibilities of XSLT a great deal. But if you take some specific factors into account while encoding, your archive can still benefit from the unmistakable power of XSLT:

  1. Predefine all attribute values or make an inventory of all values used, because in XSLT you can only match exact strings, no regular expressions![3]
  2. Make tags context-independent so they don't loose all meaning and usability when you extract them from their context or restructure the data using XSLT. eg. an addition encoded simply as <add place="facingleaf"> when it's on the facingleaf of page 4 is dependent for its absolute location in the manuscript on the tag <pb n="4"/> which precedes it. If you extract all 'facingleaf'-additions from their context using XSLT, the acquired data is unusable.
  3. Try not to make the divisions in your document dependent on empty tags like <pb />. XSLT only copies from starting tag to closing tag, and the latter is of course missing in empty tags. For instance, XSLT cannot extract <pb n="12" /> up to <pb n="24" /> in a 100 page document.

Notes

  • 1. First presented in Van Hulle & Vanhoutte (2001). [Back]
  • 2. A similar problem is discussed in detail by Birnbaum (2001). Birnbaum presents arguments for element-orientated encoding which are very relevant to the material I'm working on: "elements can provide types of structural control that are unavailable with attributes" (32). [Back]
  • 3. "support for regular expressions for matching against any or all of text nodes, attribute values, attribute names, element type names" is announced in the XSLT 1.0 specification as a "Feature under Consideration for Future Versions" <http://www.w3.org/TR/xslt> and will be available in XSLT 2.0. [Back]

Literature

  • Birnbaum, David J., (2001), "The relationship between general and specific DTDs: criticizing TEI critical editions." in: Markup Languages: Theory & Practice 3.1 (2001): 17-53.
  • Deane, Seamus (1992), "Introduction", in: James Joyce, Finnegans Wake. Penguin Books, 1992.
  • Sperberg-McQueen, C. M. (1994), "Textual Criticism and the Text Encoding Initiative." Paper presented at MLA '94, December 1994, San Diego <http://www.tei-c.org/Vault/XX/mla94.html>.
  • Van Hulle, Dirk and Edward Vanhoutte (2001), "The 'Guiltless' Notebook". Paper presented on Genetic Joyce Studies. Antwerp: University of Antwerp (UFSIA), 30 March 2001.
  • XSLT 1.0 <http://www.w3.org/TR/xslt>

XHTML auteur: Edward Vanhoutte
Last revision: 31/07/2002


Valid XHTML 1.0!