TEI Text Encoding in Libraries

Guidelines for Best Encoding Practices

Version 2.0

August 1, 2001

Editors:

Comments to Perry Willett


Contents

  1. Introduction
  2. Background
  3. General Principles
  4. Using these Guildelines
  5. Encoding Levels
    1. Level 1: Fully Automated Conversion and Encoding
    2. Level 2: Minimal Encoding
    3. Level 3: Simple Analysis
    4. Level 4: Basic Content Analysis
    5. Level 5: Scholarly Encoding Projects
  6. Attribute Values
  7. Examples
  8. Bibliography
  9. Members of the Digital Library Federation Working Group on Encoding Standards

I. Introduction

Lee Ellen's intro goes here.

Return to top


II. Background

At the TEI and XML in Digital Libraries Workshop, sponsored by the Digital Library Federation (DLF) held at the Library of Congress on June 30-July 1, 1998, three working groups were formed. Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. Representatives from six libraries met at the Library of Congress on November 12-13, 1998. We drafted a Guide to Recommended Practices that will be circulated to Working Group 2 members in May 1999. The Task Force met again at ALA mid-winter (January 1999) to incorporate comments and finalize the draft, and has continued to meet and correspond in order to maintain and improve these guidelines.

The Digital Library Federation has endorsed these guidelines.

Return to top


III. Using these Guidelines

Our recommendations are for libraries using the TEILite DTD (v1.6). We are aware that there are many different library text digitization projects, for different purposes. We drafted these recommendations to be as inclusive as possible, by recommending a series of encoding levels. These levels are meant to allow for a range of practice, including wholly automated text creation and encoding, such as practiced by the Library of Congress in its Making of America project, to encoding projects requiring content knowledge and editing.

On the whole, however, we distinguish the projects at encoding levels 1-4 from more scholarly encoding projects at level 5, by not requiring content knowledge beyond a basic level as defined in level 4. SGML and the TEI allow for experts to return to the encoded texts and enrich them with more markup. These recommendations are meant for projects whose goal is to create collections of electronic text with structural markup, and minimal semantic or content markup. Also, these recommendations are meant to be cumulative, so that recommendations are valid in succeeding levels.

These recommendations are concerned with the text block of a TEI document. Some of the recommendations will include information to be added to the TEI Header, but Working Group I will make recommendations concerning the header, in its report TEI/MARC Best Practices.

Why SGML?

SGML, the Standard Generalized Markup Language, is a set of rules for creating markup languages. HTML is one such markup language that follows the SGML rules. Markup languages following these rules are independent of operating systems, hardware, and application software, and can be used and reused for a variety of purposes. Documents encoded using SGML can be later enhanced.

Documents created following these guidelines cannot be displayed on the WWW with most browsers, so there will not be an immediate way to view them. These SGML files should be thought of as the archival electronic version. It can be relatively simple to convert these files to HTML using search-and-replace, or perl scripts, while retaining the original SGML-encoded files for long-term archival storage.

Why TEI?

The Text Encoding Initiative began in 1987 as an effort to create a standard set of Document Type Definitions for electronic texts in the humanities. Those involved at this early stage recognized standardization would be necessary for electronic texts to survive changes in operating systems, hardware and software. They developed a rich set of elements, documented in the Text Encoding Initiative Guidelines. These guidelines and their DTDs contain all the advantages for encoding provided by SGML, as well as providing a full range of possibility for encoding humanties texts. They also contain a number of superior and vastly important features, including:

Many subsequent projects, particularly the Encoded Archival Description (EAD) Guidelines, have adapted design features of the TEI. People who have used the EAD will be familiar with many features and elements of the TEI, including the header, nested structural divisions, and the ability to enhance.

We believe that the TEI DTDs are simply the best match for encoding works of historical, literary and linguistic interest. There is a list of projects using the TEI Guildelines available at the Text Encoding Initiative Corsortium website.

Why TEILite?

The TEILite DTD was developed in 1995 as a subset of the most widely used elements in the TEI tagset, with about 150 elements. Although only a subset, we have found that the elements included in TEILite are adequate for most projects, even those aiming at Level 4 encoding. At most, we find that we use only 50-60 elements from TEILite. However, beginning projects should always first determine their encoding goals with a document analysis of sample documents. Only then should one decide which set of elements or DTDs are most appropriate. The authors of these guidelines have worked on a number of projects, all using TEILite, and have found it perfectly adequate for the kind of large-scale encoding electronic text collections created in libraries.

What about XML?

As yet, there is no official XML version of the TEI DTDs. We propose continuing to use the SGML version until the TEI DTDs have been fully converted to XML. The TEI Consortium received a 2-year grant from the National Endowment for the Humanities Division of Preservation and Access in April 2001, to convert all the TEI DTDs to XML.

We do not foresee many difficulties in converting documents creating following these guidelines to XML. The only differences will be the case of the element name (since case matters in XML while it does not in SGML), and so-called "empty" elements, such as <PB>. In XML, empty elements require an additional "/" after the element name.

Return to top


IV. General recommendations:

  1. The encoding level (as described in this document) should be recorded in the <editorialDecl>, along with any deviation from the recommendations.
  2. Electronic text at all encoded levels should begin the transcription from the first word on the first leaf of the work. It may be impractical to transcribe and encode certain features of the text, such as publisher's advertisements or indexes, but if at all possible, they should be included at least as page images. Any omissions should be noted in the <editorialDecl>.
  3. File naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.
  4. Numbered <DIV>s present advantages to search and indexing software by explicitly communicating the hierarchical level of the section. One anomaly of the TEI Guidelines is that <DIV0> is not available in <FRONT> or <BACK> matter. Therefore, we recommend the use of numbered <DIV>s throughout the electronic text, always beginning with <DIV1>. Texts at all levels should include at least one <DIV1>.
  5. Page breaks <PB> should occur at the top of the page, and entirely within any DIV.
  6. Tables--I've forgotten what we said about this???

Return to top


V. Encoding Levels


VI. Attribute Values

  1. TYPE
    Constructing a list of acceptable attribute values for TYPE that could find wide agreement is impossible. Instead, it is recommended that projects document the TYPE attribute values used in their texts as part of its documentation, and that this list be made available to people using the texts. See ABC for Book Collectors by John Carter (7th edition, New Castle, DE: Oak Knoll Books, 1995) for a list of standard names and definitions of bibliographic features of printed books. For those elements where TYPE is not required, such as <HEAD> and <TITLE>, use for subtitles and additional titles, but not main titles.

  2. REND
    The difficulty with REND attributes occurs when it is desired to record more than one rendition feature. With this in mind, we have adapted a concept developed at the Brown Women Writers Project <http://www.wwp.brown.edu>, of rendition ladders. This concept allows for strings of rendition features to be included as one REND value. Rendition ladders consist of categories of renditions, with further defined values included in parentheses.


    REND should only be used to override a default value. For instance, if all text encoded as <HI> is defined as being rendered in italics, there is no reason to encode text as

    <HI REND="font(italics)">

    Combining attributes would result in a tag with attributes such as this:

    <L REND="font(italics)align(right)">

  3. FONT
    italics, bold, fsc (full and smallcaps), smallcap, underlined, gothic

  4. ALIGN
    right, left, center, block

  5. INDENT
    Values in parentheses should indicate the number of tabstops to be indented, e.g., <L REND="indent(1)">

  6. LANG
    Use ISO639-2 three-character language codes.


Return to top

VII. Examples

Return to top

VIII. Bibliography

Return to top

IX. Participants:

Return to top