TEI Text Encoding in Libraries

Guidelines for Best Encoding Practices

Version 1.0 (July 30, 1999)

Comments to Perry Willett, Indiana University (email: pwillett@indiana.edu)


  1. Introduction
  2. Participants
  3. Recommendations
  4. General Recommendations
  5. Encoding Levels
    1. Level 1: Fully Automated Conversion and Encoding
    2. Level 2: Minimal Encoding
    3. Level 3: Simple Analysis
    4. Level 4: Basic Content Analysis
    5. Level 5: Scholarly Encoding Projects
  6. Attribute Values

I. Introduction

At the TEI and XML in Digital Libraries Workshop held at the Library of Congress on June 30-July 1, 1998, three working groups were formed. Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. Representatives from six libraries met at the Library of Congress on November 12-13, 1998. The Task Force met again at ALA mid-winter (January 1999) to incorporate comments and finalize the draft. The revised recommendations were circulated to the conference working group in May 1999 and presented at the joint annual meeting of the Association of Computers and the Humanities and Association of Literary and Linguistic Computing in June 1999. Version 1.0 was circulated for comments in August 1999.

Return to top

II. Participants:

Return to top

III. Background

Our recommendations are for libraries using the TEILite DTD v1.6. There are many different library text digitization projects, for different purposes. With this in mind, the Task Force has attempted to make these recommendations as inclusive as possible by developing a series of encoding levels. These levels are meant to allow for a range of practice, from wholly automated text creation and encoding, to encoding that requires expert content knowledge, analysis, and editing.

Encoding levels 1-4 require no expert knowledge of content. Level 5, in contrast, requires scholarly analysis. Levels 1-4 allow the conversion and encoding of texts to be performed without the assistance of content experts and can be enriched with more markup at any time. Recommendations for Levels 1-4 are intended for projects wishing to create encoded electronic text with structural markup, but minimal semantic or content markup. Also, the encoding levels are cumulative: encoding requirements at each level incorporate the requirements of lower levels.

These recommendations are concerned with the text portion of a TEI-encoded document. While there are modest requirements for including certain information about encoding level in the TEI Header, a separate set of recommendations has been developed to address issues concerning TEI Header contents to MARC-format bibliographic data (see TEI/MARC Best Practices Document from Working Group 1).

Return to top

IV. General Recommendations

  1. The encoding level (as described in this document) should be recorded in the <editorialDecl>, along with an explanation of any deviation from the recommendations.
  2. Electronic text at all levels of encoding should begin with the transcription of the first word on the first leaf of the original work. It may be impractical or undesirable to transcribe and encode certain features of the text, such as publisher's advertisements or indexes, but if at all possible, they should be included as links to page images. Any omissions of material found in the original work should be noted in the <editorialDecl> in the TEI Header.
  3. File naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.
  4. Numbered <DIV>s present advantages to search and indexing software by explicitly communicating the hierarchical level of the section described. One anomaly of the TEI Guidelines is that <DIV0> is not available in <FRONT> or <BACK> matter. Therefore, we recommend the use of numbered <DIV>s throughout the electronic text, always beginning with <DIV1>. Texts at all levels should include at least one <DIV1>.
  5. Page breaks <PB> should occur at the top of the page, and entirely within any DIV.

Return to top

V. Encoding Levels

VI. Attribute Values

  1. TYPE
    Constructing a list of acceptable attribute values for TYPE that could find wide agreement is impossible. Instead, it is recommended that projects describe the TYPE attribute values used in their texts in the project documentation and that this list be made available to people using the texts. See ABC for Book Collectors by John Carter (7th edition, New Castle, DE: Oak Knoll Books, 1995) for a list of standard names and definitions of bibliographic features of printed books. For those elements where TYPE is not required, such as <HEAD> and <TITLE>, use the attribute values for subtitles and additional titles, but not main titles.

  2. REND
    Difficulty using REND attributes occurs when it is desirable to record more than one rendition feature. With this in mind, it is recommended that projects employ the following adaptation of "rendition ladders", a concept developed at the Brown Women Writers Project <http://www.wwp.brown.edu>. This concept allows for strings of rendition features to be included as one REND value. Rendition ladders consist of categories of renditions, with further defined values included in parentheses.
    REND should only be used to override a default value. For instance, if all text encoded as <HI> is defined as being rendered in italics, there is no reason to encode text as

    <HI REND="font(italics)">

    Combining attributes would result in a tag with attributes such as this:

    <L REND="font(italics)align(right)">

  3. FONT
    italics, bold, fsc (full and smallcaps), smallcap, underlined, gothic

  4. ALIGN
    right, left, center, block

    Values in parentheses should indicate the number of tabstops to be indented, e.g., <L REND="indent(1)">

  6. LANG
    Use ISO639-2 three-character language codes.

Return to top