logo CTB Centrum voor Teksteditie en Bronnenstudie (KANTL)
Koningstraat 18
b-9000 Gent
Belgium
email: ctb@kantl.be
tel: +32 (0)9 265 93 50
fax: +32 (0)9 265 93 49

De medewerkers van het CTB rapporteren over de vooruitgang van hun wetenschappelijke projecten in nationale en internationale tijdschriften en zijn frequente sprekers op (inter)nationale colloquia.

[Volledige publicatielijst CTB] [On-line publicaties] [Tekstedities en Uitgaven]

Electronic Texts in the Humanities. Principles and Practice.
Susan Hockey. New York, Oxford University Press, 2000. xii + 216 pp.
ISBN 0-19-871195-6 - £50 (cloth), ISBN 0-19-871195-8 - £18.99 (paperback)

Review by Ron Van den Branden

ron.vandenbranden@kantl.be

This book presents itself as the first overview of electronic texts in the humanities. It covers a wide array of issues involved in the acquisition, creation, evaluation, use, publication and analysis of electronically available textual material in the field of the humanities. Hockey accomplishes this with a bibliographical approach, in which principles and theory are illustrated by and / or mediated through assessment of relevant research literature. This is done in 10 chapters, spanning 170 pages of text, accompanied by 26 pages of bibliography and 17 general index pages.

In the first chapter Hockey sets out with a general motivation for and clear definition of the field of interest. She defines electronic text in the humanities as "an electronic representation of any textual material which is an object of study for literary, linguistic, historical, or related purposes" (p. 1). This definition emphasizes computer-readable transcription rather than digital images of the texts, since at present only the former "can be searched and manipulated by computer programs in many different ways" (2). Hockey nicely phrases the reason why electronic text deserves a monography on its own in the principle that "[i]t is much easier and more sensible to put the intelligence into the text in the form of encoding than it is to build sophisticated intelligence into computer programs" (p. 4-5). A short historical overview illustrates the emancipation of electronic resources from mere support materials for tedious formatting tasks to autonomous research tools, with a capacity for greatly expanding the range of possibilities for research.

In chapter 2 a lot of space is devoted to the acquisition of existing electronic texts. The difficulties one encounters with a first approach, finding usable texts on the internet, reveal some of the most important requirements for electronic texts: the availability of metadata describing their status. Since the web at present still lacks serious metadata provisions, Hockey goes on describing in some detail existing research collections of different kinds of electronic texts for applications in the humanities (Oxford Text Archive, Thesarurus Linguae Graecae, Dictionary of Old English,...) and linguistics (Brown Corpus, Lancaster-Oslo/Bergen Corpus, International Corpus of English,...). This overview is followed by a short discussion of corpus design and its influences on corpus research. The chapter concludes with a description of methods to create electronic texts from scratch, briefly discussing the cons of Optical Character Recognition and getting texts from typesetting tapes, and the pros of keyboarding.

One of the benefits of keyboarding text is the possibility of adding some interpretative encoding at the same time, to which the entire third chapter is devoted. In her argument, Hockey parallels the developing awareness of the needs for structured metadata in electronic resources with the history of encoding systems. She explains how and why typographic markup aimed at generating print output from electronic research materials was soon abandoned for structural markup, starting with codes for easier line citation, and developing to full-fledged encoding schemes like COCOA, HTML, SGML, XML and MECS. In this overview, a number of problem areas are touched, like the encoding of non-standard characters, texts in foreign languages, with different writing systems and writing directions. Possible solutions like transliteration and Unicode are pointed out. The main focus of this chapter is put on the principles of SGML/XML, and of one of the most significant applications in this format, the encoding scheme proposed by the Text Encoding Initiative (TEI). The basics of this scheme are covered in great detail, culminating in a nearly 6 page long (and by then comprehensible) example; yet Hockey manages to nuance her enthusiasm by acknowledging some shortcomings in the area of encoding details of analytic bibliography, newspaper texts, spoken texts, and a problem inherent in XML-based schemes: overlapping hierarchies.

Chapter 4 forms a bridge between the theoretical background of the first three chapters and the overview of applications of electronic texts in different research areas in the following chapters. Hockey presents an introduction to a basic methodological toolbox for the analysis of electronic texts offered by concordance and text retrieval software. The most basic functionality is the derivation of a word frequency list, possibly enriched with references to all their occurrences in the text in a word index. Concordances provide even more direct contextual information by surrounding the words with the text fragments in which they occur. A further common function is the retrieval of pairs of frequently co-occurring words, and presenting those with some context in a list of collocations. These concepts are illustrated with example outputs. Trivial as these tasks may seem, Hockey points out some problems regarding lemmatisation (apostrophes and hyphens) and alphabetical sorting (spelling peculiarities, non-standard characters). Ultimately, such basic problems with machine-interpretation of natural language again illustrate the pressing need for explication of ambiguous features through markup.

The next 5 chapters provide an overview of a wide range of applied research using electronic texts. Chapter 5 highlights research in the area of literary analysis, directing the study of a myriad of topics like, gender, character analysis, themes and imagery, names, sound and metre, drama, etc. Chapter 6 discusses linguistic research using electronic texts. Hockey points out how analysis of bare textual features, measured by type/token ratios, collocations, and multivariate factor analysis can yield useful insights for lexical and grammar studies, syllabus design and sociolinguistic research. More sophisticated analysis requires more sophisticated texts, enriched with some linguistic annotations like lemmatisation or morpho-syntactic codes. This leads Hockey to a discussion of some issues in part of speech tagging for which she takes the historical development of the Brown corpus and CLAWS tagging software as a case in point. Chapter 7 evaluates how statistical analysis of vocabulary usage, word and sentence length, and word frequencies of (un)common words has been put to use in stylometry and attribution research. Hockey starts chapter 8 with some terminological clarifications regarding electronic editions, and an overview of tools and techniques for the preparation and publication of scholarly editions, both in traditional paper and electronic formats. The development and uses of two outstanding software packages geared toward the production of collations, TuStep and Collate, are described in detail. A second major part in this chapter is devoted to current practice and theory regarding electronic editions. A description of some HTML-editions is followed by an evaluation of various efforts to build a theoretical understanding of electronic editions: the Guidelines for Electronic Scholarly Editions (MLA), Shillingsburg's Principles for Electronic Archives, Scholarly Editions, and Tutorials and the insights gained from the Model Editions Partnership (MEP). Finally, two theoretically well-developed edition projects are discussed: editions by the MEP and Robinson's The Wife of Bath Prologue. Hockey concludes with her reserves regarding the status of digital images in electronic editions.

Chapter 9 focuses on the development of dictionaries and lexical databases for human use as well as for use within computer-oriented tools. Again, history forms the backbone of Hockey's story. Explaining the evolution from computer-assisted production of traditional dictionaries to that of fully electronic products, she illustrates how electronic texts can feed into better analysis tools and more intelligent retrieval systems for these and other electronic resources. Chapter 10 concludes with Hockey's expectations and desiderata for future developments in the further establishment of scholarship, multi-disciplinary cooperation, creation tools and analysis software.

As this extensive resume suggests, the rich contents of the book make it a valuable resource for a broad public interested in the development and use of electronic texts. Much effort is made to avoid an over-technical perspective, by starting from clear definitions and focusing on understandable principles. Hockey's continuous stress on the need for a firm theoretical basis underlying research using electronic texts, the need for methodological transparency and the dangers of technological reductionism, encapsulates her treatment of the wide scale of research and studies in a coherent vision. Furthermore, the interdisciplinary approach allows her to stress the opportunities for mutually inspiring exchanges of methodologies, advances and insights in the different research areas that are discussed. However, some critical notes can be made about the organisation of the materials. Of course, each of the subjects addressed fills whole bookshelves of monographies (and many of those are mentioned and listed in the bibliography). Yet, the limited space in some places downgrades the abundant amount of research reviews to a shallowly annotated list of references (for example, chapter 5 mentions about 30 studies in only 18 pages). The lack of graphical materials (actually, chapter 4 is the only one containing figures, in the form of tables) is sometimes annoying, especially in descriptions of electronic products and their interfaces. Also, the combination of a loose structure and interdisciplinary approach is not always successful. Often Hockey hooks her narrative to methodological pegs, other times the reader has to notice a transition to a more thematic organisation. This also causes some methodological-theoretical issues, like for example corpus design, to reappear on different places throughout the book, whereas it would perhaps be more suitable to treat them in a less fragmented manner. The boundaries between the different research areas in the chapters are not always very clear either. Some topics in the fifth chapter on literary analysis (echoed phrases, genre, gender analysis) and the sixth chapter on linguistic analysis (analysis of lexical features to characterise textual styles) seem to overlap with the seventh chapter on stylometry and attribution studies (in which only the latter subject is highlighted, however). This may result from an imbalance in assumed background knowledge about the techn(olog)ical aspects of electronic texts on the one hand, and the different research areas that make use of them on the other Sometimes the narrative prevails over clear structuring of and motivation for the items covered. Hockey very rigidly introduces the technological concepts in a very comprehensible manner, yet neglects to introduce the chapters discussing their application in the different research fields with a clear theoretical definition of those fields. This leads me to a (maybe provoking, but hopefully balanced) conclusion about the book. At its best (in my opinion mainly the first 4 chapters), it provides a broad overview of humanities computing and literature in that field. At its worst (chapters 5 to 9), it threatens to prove that a reader does not necessarily need electronic texts to get lost in her quest for relevant information.



© 2004 ALLC, CTB, and Ron Van den Branden

This text is also published in Literary and Linguistic Computing, 19(3) (2004): 245-248.


XHTML auteur: Edward Vanhoutte
Last revision: 07/07/2004


Valid XHTML 1.0!