Victorian Women Writers Project

A (very brief) Introduction to SGML

(Thanks to David Seaman of the University of Virginia for permission to copy from his The Electronic Text Center Introduction to TEI and Guide to Document Preparation.)


The texts in the Victorian Women Writers Project collection of electronic texts are tagged using Standard Generalized Markup Language (SGML), a system for describing structural divisions in text (title-page, chapter, scene, stanza, etc.), typographical elements (changes in typeface, special characters, etc.), and other textual features (grammatical structure, location of illustrations, variant forms, etc.).

SGML tags consist of ASCII data only; they are not proprietary to a particular computer program. This sets them apart from -- say -- the codes in a WordPerfect document, which belong to, and are meaningful only within, the WordPerfect program. And while the WordPerfect code defines something by its visual appearance -- a word is italicized -- SGML is designed to describe the class of information to which the phrase belongs. Italics can be used for a variety of purposes, and most SGML tagsets can clearly delineate an emphatic word from a book's title or a chapter heading.

By recording the structure of a text, such tags allow one to use an SGML search program to constrain searches to particular elements: one cannot limit a search to a single chapter in a novel if there are no markers in the text for chapter divisions; one cannot view a quotation from a play in the context of a scene if the scenes are not delimited.

A chapter whose title should appear in italics could be tagged like this:
<div type="Chapter" n=1>
<head rend="italics">The fianc&eacute;e </head>
</div>

Features to notice:

More on Entity References

Since the SGML system represents text through a limited number of ASCII characters, any character that falls outside this ASCII group must be represented with a special character tag. Each special character tag consists of a brief descriptive term surrounded by an ampersand (&) at the beginning and a semicolon (;) at the end. So, the character "à" appears as &agrave;. Note that, since the ampersand marks the beginning of each special character entity, if an ampersand appears in the text, it requires its own entity: &amp; Note that the character used to separate words is generally an "em dash" and is encoded as &mdash:

And for our sakes--ours--the freed ones,
<l>And for our sakes&mdash;ours&mdash;the freed ones,</l>

The character that joins words is generally a hyphen:

'Tis the triumph-time of grace!
<l>&rsquo;Tis the triumph&hyphen;time of grace!</l>

Note also that apostrophes are encoded as right single quotation marks--&rsquo;

Here is a list of common entity references. A fuller list can be found in Appendix C of Electronic Manuscript Preparation and Markup: American National Standard for Electronic Manuscript Preparation and Markup (available in LETRS at Z283.E43 E428 1991).

Character	Name				Entity ref.
á		a acute		 		&aacute;
à		a grave 			&agrave;
â		a circumflex			&acirc;
æ 		ae ligature (lower case)	&aelig;
Æ		AE ligature (upper case)	&AElig;
&		ampersand			&amp;
:		colon				&colon;
é		e acute				&eacute;
è		e grave				&egrave;
ê		e circumflex			&ecirc;
-		hyphen				&hyphen;
"		left double quotation		&ldquo;
‘		left single quotation		&lsquo;
--		em dash				&mdash;
"		right double quotation		&rdquo;
'		right single quotation		&rsquo;

About the VWWP
To the VWWP Home Page
To the VWWP Library