Editors:
- Lee Ellen Friedland, Library of Congress
- Nancy Kushigian, University of California, Davis
- Christina Powell, University of Michigan
- Natalia Smith, University of North Carolina, Chapel-Hill
- Perry Willett, Indiana University
Comments to Perry Willett
Contents
- Introduction
- Background
- General Principles
- Using these Guildelines
- Encoding Levels
- Attribute Values
- Examples
- Bibliography
- Members of the Digital Library Federation Working Group on Encoding Standards
I. Introduction
Lee Ellen's intro goes here.
II. Background
At the TEI and XML in Digital Libraries Workshop, sponsored by the Digital Library Federation (DLF) held at the Library of Congress on June 30-July 1, 1998, three working groups were formed. Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. Representatives from six libraries met at the Library of Congress on November 12-13, 1998. We drafted a Guide to Recommended Practices that will be circulated to Working Group 2 members in May 1999. The Task Force met again at ALA mid-winter (January 1999) to incorporate comments and finalize the draft, and has continued to meet and correspond in order to maintain and improve these guidelines.
The Digital Library Federation has endorsed these guidelines.
III. Using these Guidelines
Our recommendations are for libraries using the TEILite DTD (v1.6). We are aware that there are many different library text digitization projects, for different purposes. We drafted these recommendations to be as inclusive as possible, by recommending a series of encoding levels. These levels are meant to allow for a range of practice, including wholly automated text creation and encoding, such as practiced by the Library of Congress in its Making of America project, to encoding projects requiring content knowledge and editing.
On the whole, however, we distinguish the projects at encoding levels 1-4 from more scholarly encoding projects at level 5, by not requiring content knowledge beyond a basic level as defined in level 4. SGML and the TEI allow for experts to return to the encoded texts and enrich them with more markup. These recommendations are meant for projects whose goal is to create collections of electronic text with structural markup, and minimal semantic or content markup. Also, these recommendations are meant to be cumulative, so that recommendations are valid in succeeding levels.
These recommendations are concerned with the text block of a TEI document. Some of the recommendations will include information to be added to the TEI Header, but Working Group I will make recommendations concerning the header, in its report TEI/MARC Best Practices.
Why SGML?
SGML, the Standard Generalized Markup Language, is a set of rules for creating markup languages. HTML is one such markup language that follows the SGML rules. Markup languages following these rules are independent of operating systems, hardware, and application software, and can be used and reused for a variety of purposes. Documents encoded using SGML can be later enhanced.
Documents created following these guidelines cannot be displayed on the WWW with most browsers, so there will not be an immediate way to view them. These SGML files should be thought of as the archival electronic version. It can be relatively simple to convert these files to HTML using search-and-replace, or perl scripts, while retaining the original SGML-encoded files for long-term archival storage.
Why TEI?
The Text Encoding Initiative began in 1987 as an effort to create a standard set of Document Type Definitions for electronic texts in the humanities. Those involved at this early stage recognized standardization would be necessary for electronic texts to survive changes in operating systems, hardware and software. They developed a rich set of elements, documented in the Text Encoding Initiative Guidelines. These guidelines and their DTDs contain all the advantages for encoding provided by SGML, as well as providing a full range of possibility for encoding humanties texts. They also contain a number of superior and vastly important features, including:
Many subsequent projects, particularly the Encoded Archival Description (EAD) Guidelines, have adapted design features of the TEI. People who have used the EAD will be familiar with many features and elements of the TEI, including the header, nested structural divisions, and the ability to enhance.
- TEI Header, for recording metadata of the electronic file and its source;
- Flexibility in enhancing and modifying elements, attributes, and content models.
We believe that the TEI DTDs are simply the best match for encoding works of historical, literary and linguistic interest. There is a list of projects using the TEI Guildelines available at the Text Encoding Initiative Corsortium website.
Why TEILite?
The TEILite DTD was developed in 1995 as a subset of the most widely used elements in the TEI tagset, with about 150 elements. Although only a subset, we have found that the elements included in TEILite are adequate for most projects, even those aiming at Level 4 encoding. At most, we find that we use only 50-60 elements from TEILite. However, beginning projects should always first determine their encoding goals with a document analysis of sample documents. Only then should one decide which set of elements or DTDs are most appropriate. The authors of these guidelines have worked on a number of projects, all using TEILite, and have found it perfectly adequate for the kind of large-scale encoding electronic text collections created in libraries.
What about XML?
As yet, there is no official XML version of the TEI DTDs. We propose continuing to use the SGML version until the TEI DTDs have been fully converted to XML. The TEI Consortium received a 2-year grant from the National Endowment for the Humanities Division of Preservation and Access in April 2001, to convert all the TEI DTDs to XML.
We do not foresee many difficulties in converting documents creating following these guidelines to XML. The only differences will be the case of the element name (since case matters in XML while it does not in SGML), and so-called "empty" elements, such as <PB>. In XML, empty elements require an additional "/" after the element name.
IV. General recommendations:
- The encoding level (as described in this document) should be recorded in the <editorialDecl>, along with any deviation from the recommendations.
- Electronic text at all encoded levels should begin the transcription from the first word on the first leaf of the work. It may be impractical to transcribe and encode certain features of the text, such as publisher's advertisements or indexes, but if at all possible, they should be included at least as page images. Any omissions should be noted in the <editorialDecl>.
- File naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.
- Numbered <DIV>s present advantages to search and indexing software by explicitly communicating the hierarchical level of the section. One anomaly of the TEI Guidelines is that <DIV0> is not available in <FRONT> or <BACK> matter. Therefore, we recommend the use of numbered <DIV>s throughout the electronic text, always beginning with <DIV1>. Texts at all levels should include at least one <DIV1>.
- Page breaks <PB> should occur at the top of the page, and entirely within any DIV.
- Tables--I've forgotten what we said about this???
V. Encoding Levels
- V.1. LEVEL 1: Fully Automated Conversion and Encoding.
Purpose: To create electronic text with the primary purpose of keyword searching and linking to page images. The primary advantage in using the TEILite DTD at this level is that a TEI Header is attached.
Rationale: That text is subordinate to the page image, and is not intended to stand alone as an electronic text (without page images).
Texts at Level 1 can be created and encoded by fully automated means, using uncorrected OCR of page images ("dirty OCR"), or exporting from existing electronic text files. Only those tags that are necessary to divide the text from the header and facilitate linking to page images are used. Encoding is performed automatically based on artifacts of the OCR or other document creation process (page breaks, for example) and metadata collected during the imaging process. This encoding is both minimal and reliable, precluding extensive review of each page of each text.
Level 1 texts are not electronic texts of the sort that would be undertaken by a library to provide textual analysis capabilities; they are more likely to be the initiated by a preservation unit or mass digitization initiative. It may even be presumptuous to call them encoded texts. However, Level 1 texts are fully valid in the SGML sense. In addition to taking advantage of the TEI Header, using the TEILite DTD allows Level 1 texts to be searched in concert with more richly encoded TEILite texts, and further encoding based on document structures could be added at a later time.
Level 1 is most suitable for projects with the following characteristics:
- a large volume of material is to be made available online quickly
- digital images of each page are desired
- the materials will stand up to high speed scanning processes, and are candidates for disbinding if necessary
- no manual intervention will be performed in the text creation process
- the material is of interest to a large community of users who are interested in access to reading copies with keyword searching capability
- sophisticated search and display capabilities based on the structure of the text are not necessary
- extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added at a later date
Return to top
<DIV1> Type="section" is the default attribute value. <P> One "container" element per DIV is required. <PB> This is required in Level 1. Page images can be linked to the text using ID/IDREF, or ENTITYREF attributes. Using ENTITYREF has advantages for maintaining large numbers of image files, but would require modifying the TEILite DTD. <FIGURE> This element is optional at Level 1. The advantage for using <FIGURE> is the ability to record metadata using <FIGDESC>.
- V.2. LEVEL 2: Minimal Encoding
Purpose: To create electronic text with the primary purpose of keyword searching, linking to page images, and identifying simple structural hierarchy to improve navigation.
Rationale: The text is subordinate to the page image, though navigational markers (textual divisions, heads) are captured, and the text could stand alone as electronic text (without page images) if the accuracy of its contents is suitable to its use and if it is not necessary to display low-level typographic or structural information. Level 2 requires a set of elements more granular than those of Level 1, i.e., bibliographic or structural information below the monographic or volume level, but not requiring a specialist to identify.
Though texts at Level 2 can be created and encoded by automated means, based on the typographic elements in the electronic file (bold centered text at the top of the page surrounded by whitespace indicates a new chapter head, and thus a new division, for example), it is not likely to be absolutely reliable within one volume or across a large body of material. The decision to embark on the second level of encoding requires a certain amount of human intervention at each textual division and heading. Level 2 texts do not require any specialist knowledge, or manual intervention below the section level.
If the quality of the text warrants, Level 2 texts can be displayed separately from their page images. Even within the page image model, encoding sections and heads will provide greater navigational possibilities than Level 1, and searching can be restricted within particular textual divisions (i.e., searching for two phrases within the same chapter).
NOTE ON DIV
Structural divisions within a text can be difficult to identify, and encoding consistency within a work is essential. The strongest clue of a new <DIV1> is within the table of contents. If there is an entry for a section listed in the table of contents, it is strong evidence of a new DIV1. Other strong evidence for a new <DIV1> includes:
Some weaker evidence:
- Blank page followed by a new heading
- Heading followed by drop cap or ornamental letter
- Numbering scheme associated with headings
- Ornamental device
- Marginal numbered headings
Use other types of chunking if the evidence for a new DIV isn't clear: <P>, <LG>.
Level 2 is most suitable for projects with the following characteristics:
- a large volume of material is to be made available online quickly
- digital images of each page are desired
- the materials will stand up to high speed scanning processes, and are candidates for disbinding if necessary
- the material is of interest to a large community of users who are interested in access to reading copies with keyword searching capability
- rudimentary search and display capabilities based on the large structures of the text are desired
- each text will be checked to ensure that divisions and headers are properly identified
- extensibility is desired; that is, one desires to keep open the option for a higher level of tagging to be added at a later date
<FRONT>, <BACK> Optional <HEAD> Required if present <DIV1> Type="section" is the default attribute value. It is recommended that the N attribute be included to record the div sequence. <P> One "container" element per DIV is required.
- V.3. LEVEL 3: Simple Analysis
Purpose: To create text that can stand alone as electronic text and identifies hierarchy and typography without content analysis being of primary importance.
Assumptions: Level 3 texts could be created easily from existing HTML or word processing documents, giving them the advantage of the TEI Header, interoperability with other TEI collections, and extensibility to Level 4 at a later date. This level generally requires some human editing, but the features to be encoded are determined by the appearance of the text and not content knowledge.
<FRONT>, <BACK> Required if present. <P> Required for paragraph breaks in prose; may be used for stanzas using <LB> for line breaks in verse. <FIGURE> Required to indicate figures other than page images. <HI> Required to indicate changes in typeface. REND attribute is optional. <NOTE> All notes must be encoded. It is also recommended that notes that extend beyond one page be combined into one <NOTE> element. Marginal notes, without reference, should occur at the beginning of the paragraph to which they refer, with the value of the PLACE attribute as "margin".
NOTE ON <NOTE>:
For processing reasons, it may be desirable to move footnotes from their original location in the text. If left at the bottom of a page, a note may become included in another paragraph or section of the encoded text, and thus separated from its reference. There are options for placement of footnotes if they are moved:
- Inline. The note is inserted at the point of reference. An attribute is the value of the note. No <REF> element is needed with this option.
- End-of-Paragraph. <REF> with target attribute occurs at point of reference. <NOTE> with ID attribute occurs within, but at the end of the paragraph in which the reference occurs.
- End-of-Div. Notes moved to the end of the <DIV>
- V.4. LEVEL 4: Basic Content Analysis
Purpose: To create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.
Rationale: Greater description of function and content allows for:
- flexibility of display and delivery
- sophisticated searching within specified textual and structural elements
- combines the broadest range of uses and audiences
Texts encoded at Level 4 are truly "electronic texts," or "electronic books." They are able to stand alone as part of a library collection, and do not require images in order for them to be read by students, scholars and general readers. This level of TEI encoding allows them to be displayed or printed in a variety of ways suitable for classroom or scholarly use.
Here, texts are tagged functionally: tags and attributes are descriptive of content. For example, lines of verse are tagged with <L>, the <P> tag is reserved for true paragraphs. Front and back matter is encoded and tagged. Attributes of the text that contribute to meaning are preserved, such as indentation of lines of verse and typography. These are textual features that are not encoded at lower levels and that allow the text to be used independent of images.
This ability to stand alone as text, means that Level 4 texts can be uploaded, downloaded and delivered quickly, and require less storage space than collections with page images. This has been the level at which library projects at Michigan, Indiana, UC-Davis, Virginia, UNC-CH, and other libraries have encoded large collections of texts.
Finally, functionally accurate tagging in Level 4 texts allows them to be searched or displayed in sophisticated ways. For example, a searcher could limit his or her search in a dramatic text to stage directions or to the speech of a particular character. In a volume of poetry published by subscription, a search could be limited simply to names that appear in lists, thus limiting a search to names of people who subscribed to a particular volume. This ability to limit searches becomes more significant as text-bases become larger, and thus is of great importance to the library community as it attempts to build interoperability into its initial design and implementation of text-bases.
Level 4 is most suitable for projects with the following characteristics:
- the users of the texts are distributed over a wide geographic area
- the users of the texts may have limited storage or display capabilities
- sophisticated search and display capabilities are desired
- the collection is of interest to a currently existing, well defined community of users, such as that that would constitute a market for any published text or collection of texts
- the collection is rare and not available to users in print or other electronic formats
- the texts will be used for pedagogical or scholarly purposes, not just as reading copies
- extensibility is desired; that is, one desires to keep open the option for level V tagging to be added by the scholarly community at a later date
In considering such a level 4 TEI digitization project, an academic library should consult with faculty members and collection bibliographers, and ask the following question: Is this collection of texts one that the library should purchase if it were available commercially? If so, the benefits of a level four project are many, for the result is a freely available collection of texts owned and administered by the library community, thus free of licensing restrictions and on-going access charges.
- General Level 4 Recommendations:
- V.4.1. Emphasized text should be encoded as <FOREIGN>, <TITLE>, <EMPH>, as appropriate. Any ambiguous emphasized text should be encoded as <HI>.
- V.4.2. It is recommended that the <SIC> element be used to indicate typographic errors, with corrections noted as the value of the CORR attribute.
- V.4.3. <TITLEPAGE> should include the verso if present, divided with by <PB N="verso">. Tables of contents, errata, subscription lists, "other titles by the same author" should be included in a separate numbered DIV, as a <LIST> with <ITEM>s. Frontispieces should be encoded as a <FIGURE>, within a separate numbered <DIV> and <P>.
- Level 4 Prose:
- V.4.4. Letters that occur within the text body provide some challenges. It is recommended that quoted letters that occur as part of a text (and not collections of letters themselves) be encoded within <q><text><body><div1 type="letter">, with <opener>, <dateline>, <salute>, <signed>, <closer> included as appropriate.
- V.4.5. Quotations that do not occur inline, but are set off typographically in some way, should be encoded as <q>.
- V.4.6. Notes are to be encoded as described in Level 3.
- V.4.7. <Argument>, <Opener>, <epigraph>, <closer>, <trailer>, <add>, <del>, <unclear> as appropriate.
- Level 4 Drama:
- V.4.8. Cast lists should be encoded as <LIST>s, with <ITEM>s.
- V.4.9. Speeches are encoded as <SP>, with speakers identified within <SPEAKER> elements.
- Level 4 Verse:
- V.4.10 All verse, even poems without separate stanzas or verse paragraphs, should be contained within a line group element <LG>. This will assist with automated processing and retrieval.
- V.4.11 It is common to see informal divisions within poems, noted by a string of asterisks or periods. These should be encoded as <MILESTONE>s with attribute values of UNIT="typography"and N= indicating the character used and its occurrence, <MILESTONE UNIT="typography" N="******">.
- V.4.12 <L> It is strongly recommended that indentation is recorded using the REND attribute.
- Level 4 Front and Back Matter:
- V.4.13 It is recommended that all prefaces, tables of contents, afterwords, appendices, endnotes and apparatus be encoded. For publisher's advertisements, indexes, and glossaries or other front or back matter that isn't considered of primary importance to the text, there are three options:
- Fully transcribe and encode
- Link to page images (may include an unencoded transcription)
- Omit, noted in <EditorialDesc>
- LEVEL 5: Scholarly Encoding Projects
Level 5 texts are those that require subject knowledge, and encode semantic, linguistic, prosodic, or other elements beyond basic structural elements.
Return to top
VI. Attribute Values
Return to top
- TYPE
Constructing a list of acceptable attribute values for TYPE that could find wide agreement is impossible. Instead, it is recommended that projects document the TYPE attribute values used in their texts as part of its documentation, and that this list be made available to people using the texts. See ABC for Book Collectors by John Carter (7th edition, New Castle, DE: Oak Knoll Books, 1995) for a list of standard names and definitions of bibliographic features of printed books. For those elements where TYPE is not required, such as <HEAD> and <TITLE>, use for subtitles and additional titles, but not main titles.- REND
The difficulty with REND attributes occurs when it is desired to record more than one rendition feature. With this in mind, we have adapted a concept developed at the Brown Women Writers Project <http://www.wwp.brown.edu>, of rendition ladders. This concept allows for strings of rendition features to be included as one REND value. Rendition ladders consist of categories of renditions, with further defined values included in parentheses.
REND should only be used to override a default value. For instance, if all text encoded as <HI> is defined as being rendered in italics, there is no reason to encode text as<HI REND="font(italics)">
Combining attributes would result in a tag with attributes such as this:
<L REND="font(italics)align(right)">
- FONT
italics, bold, fsc (full and smallcaps), smallcap, underlined, gothic- ALIGN
right, left, center, block- INDENT
Values in parentheses should indicate the number of tabstops to be indented, e.g., <L REND="indent(1)">- LANG
Use ISO639-2 three-character language codes.
VII. Examples
Return to top
VIII. Bibliography
Return to top
- Guidelines for Electronic Text Encoding and Interchange (TEI Guidelines). Edited by C.M Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994. 2 vols. Online version at the University of Michigan Humanities Text Initiative.
- The Text Encoding Initiative Consortium
- TEI and XML in Digital Libraries Workshop
- Digital Library Federation
- TEI/MARC Best Practices
- Robinson, Peter. The Transcription of Primary Text Sources Using SGML. Oxford: Office for Humanities Communication, 1994.
IX. Participants:
- LeeEllen Friedland, Library of Congress
- Nancy Kushigian, University of California, Davis
- Christina Powell, University of Michigan
- David Seaman, University of Virginia
- Natalia Smith, University of North Carolina at Chapel Hill
- Perry Willett, Indiana University