Comments to Perry Willett, Indiana University (email: pwillett@indiana.edu)
At the TEI and XML in Digital Libraries Workshop held at the Library of Congress on June 30-July 1, 1998, three working groups were formed. Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. Representatives from six libraries met at the Library of Congress on November 12-13, 1998. The Task Force met again at ALA mid-winter (January 1999) to incorporate comments and finalize the draft. The revised recommendations were circulated to the conference working group in May 1999 and presented at the joint annual meeting of the Association of Computers and the Humanities and Association of Literary and Linguistic Computing in June 1999. Version 1.0 was circulated for comments in August 1999.
Our recommendations are for libraries using the TEILite DTD v1.6. There are many different library text digitization projects, for different purposes. With this in mind, the Task Force has attempted to make these recommendations as inclusive as possible by developing a series of encoding levels. These levels are meant to allow for a range of practice, from wholly automated text creation and encoding, to encoding that requires expert content knowledge, analysis, and editing.
Encoding levels 1-4 require no expert knowledge of content. Level 5, in contrast, requires scholarly analysis. Levels 1-4 allow the conversion and encoding of texts to be performed without the assistance of content experts and can be enriched with more markup at any time. Recommendations for Levels 1-4 are intended for projects wishing to create encoded electronic text with structural markup, but minimal semantic or content markup. Also, the encoding levels are cumulative: encoding requirements at each level incorporate the requirements of lower levels.
These recommendations are concerned with the text portion of a TEI-encoded document. While there are modest requirements for including certain information about encoding level in the TEI Header, a separate set of recommendations has been developed to address issues concerning TEI Header contents to MARC-format bibliographic data (see TEI/MARC Best Practices Document from Working Group 1).
Purpose: To create electronic text with the primary purpose of keyword searching and linking to page images. The primary advantage in using the TEILite DTD at this level is that a TEI Header is attached to the text file.
Rationale: That text is subordinate to the page image, and is not intended to stand alone as an electronic text (without page images).
Texts at Level 1 can be created and encoded by fully automated means, using uncorrected OCR of page images ("dirty OCR"), or exporting from existing electronic text files. Only those tags that are necessary to divide the text from the header and facilitate linking to page images are used. Encoding is performed automatically based on artifacts of the OCR or other document creation process (page breaks, for example) and metadata collected during the imaging or preparation process. This encoding is both minimal and reliable, and does not typically require extensive review of each page of each text.
Level 1 texts are not intended to be adequate for textual analysis; they are more likely to be suited to the goals of a preservation unit or mass digitization initiative. Though their encoding is minimal, Level 1 texts are fully valid SGML texts. In addition to taking advantage of the TEI Header, using the TEILite DTD allows Level 1 texts to be compatible with more richly encoded TEILite texts for searching, for example. Further encoding based on document structures or content analysis can be added to a Level 1 text at any time.
Level 1 is most suitable for projects with the following characteristics:
| <DIV1> | Type="section" is the default attribute value. |
| <P> | One "container" element per DIV is required. |
| <PB> | This is required in Level 1. Page images can be linked to the text using ID/IDREF or ENTITYREF attributes. Using ENTITYREF has advantages for maintaining large numbers of image files, but would require modifying the TEILite DTD. |
| <FIGURE> | This element is optional at Level 1. The advantage of using <FIGURE> is the ability to record metadata using <FIGDESC>. |
Purpose: To create electronic text for keyword searching, linking to page images, and identifying simple structural hierarchy to improve navigation.
Rationale: The text is subordinate to the page image, though navigational markers (textual divisions, heads) are captured. The text could stand alone as electronic text (without page images) if the accuracy of its contents is suitable to its intended use and it is not necessary to display low-level typographic or structural information. Level 2 requires a set of elements more granular than those of Level 1, including bibliographic or structural information below the monographic or volume level, but still does not require a specialist to identify.
Though texts at Level 2 can be created and encoded by automated means, based on the typographic elements in the electronic file (for example, bold centered text at the top of the page surrounded by whitespace indicates a new chapter head, and thus a new division), it is not likely to be absolutely reliable across a large body of material. Level 2 encoding requires some human intervention to identify each textual division and heading. Level 2 texts do not require any specialist knowledge or manual intervention below the section level.
Level 2 texts can be displayed separately from their page images. Even when displayed with page images, Level 2 encoding of sections and heads provides greater navigational possibilities than Level 1 encoding, and enables searching to be restricted within particular textual divisions (for example, searching for two phrases within the same chapter).
Level 2 is most suitable for projects with the following characteristics:
All elements specified in Level 1 plus the following:
| <FRONT>, <BACK> | Optional |
| <HEAD> | Required if present |
| <DIV1> | Type="section" is the default attribute value. It is recommended that the N attribute be included to record the div sequence. |
| <P> | One "container" element per DIV is required. |
Purpose: To create text that can stand alone as electronic text and identifies hierarchy and typography without content analysis being of primary importance.
Rationale: Level 3 texts can be created from scratch or by the relatively easy conversion of existing HTML or word-processing documents. Encoding offers the advantage of the TEI Header, interoperability with other TEI collections, and extensibility to higher levels of encoding. Level 3 generally requires some human editing, but the features to be encoded are determined by the appearance of the text and not specialized content analysis.
Level 3 texts identify front and back matter, and all paragraph breaks. The finer granularity of tagging these features, as well as figures, notes, and all changes of typography, allows a range of options for display, delivery, and searching. For example, one has the option of identifying and, therefore, specifying the display charactersitics of different typographic styles, and regularizing the display and placement of note text.
Level 3 texts can stand alone as text without page images and, therefore, can be uploaded, downloaded and delivered quickly, and require less storage space than digital collections with page images. However, the simple level of structural anaylsis and absence of specialized content analysis reflected in Level 3 tagging may make it desirable for some, depending on project priorities, to include page images in order to provide users with a fuller set of resources.
Level 3 is most suitable for projects with the following characteristics:
All elements specified in Levels 1 and 2, plus the following:
| <FRONT>, <BACK> | Required if present. |
| <P> | Required for paragraph breaks in prose; may be used for stanzas using <LB> for line breaks in verse. |
| <FIGURE> | Required to indicate figures other than page images. |
| <HI> | Required to indicate changes in typeface. REND attribute is optional. |
| <NOTE> | All notes must be encoded. It is also recommended that notes that extend beyond one page be combined into one <NOTE> element. Marginal notes, without reference, should occur at the beginning of the paragraph to which they refer, with the value of the PLACE attribute as "margin". |
NOTE ON <NOTE>:
For processing reasons, it may be desirable to move footnotes from their original location in the text. If left at the bottom of a page, a note may become included in another paragraph or section of the encoded text, and thus separated from its reference. There are options for placement of footnotes if they are moved:
Purpose: To create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.
Rationale: Greater description of function and content allows for:
Texts encoded at Level 4 are able to stand alone as part of a library collection, and do not require images in order for them to be read by students, scholars and general readers. This level of TEI encoding allows them to be displayed or printed in a variety of ways suitable for classroom or scholarly use.
Level 4 texts contain tags and attributes that describe content. For example, lines of verse are tagged with <L>; the <P> tag is reserved for true paragraphs. Attributes of the text that contribute to meaning are preserved, such as indentation of lines of verse and typography. These are textual features that are not encoded at lower levels and that allow the text to be used and understood fully independent of images.
The ability to stand alone as text means that Level 4 texts can be uploaded, downloaded, and delivered quickly, and require less storage space than collections with page images.
Finally, functionally accurate tagging in Level 4 texts allows them to be searched or displayed in sophisticated ways. For example, a searcher could limit his or her search in a dramatic text to stage directions or to the speeches of a particular character. In a volume of poetry published by subscription, a search could be confined to names that appear in lists, thus limiting a search to names of people who subscribed to a particular volume. This ability to limit searches becomes more significant as textbases become larger, and thus is of great importance to the library community as it attempts to build into the initial design and implementation of textbases features needed to enhance interoperability.
Level 4 is most suitable for projects with the following characteristics:
In considering such a level 4 TEI digitization project, an academic library should consult with faculty members and collection bibliographers, and ask the following question: Is this collection of texts one that the library should purchase if it were available commercially? If so, the benefits of a Level 4 project are many, for the result is a freely available collection of texts owned and administered by the library community, thus free of licensing restrictions and on-going access charges.
Level 5 texts are those that require subject knowledge, and encode semantic, linguistic, prosodic or other elements beyond a basic structural level.