Golder Encoding Conventions
Introduction
The ‘Golder Project’ involves publishing William Golder’s poetry in electronic form, together with contextual material that will assist a reader to interpret this early New Zealand poetry.
There are four volumes of poetry, whose short titles are: NZ Minstrelsy (1852); Pigeons’ Parliament (1854); NZ Survey (1867); and Philosophy of Love (1871).
The electronic version of these volumes is in the form of XML files created in accordance with the TEI guidelines (TEI, 2003).
These specifications have been developed through the initial work on NZ Minstrelsy and then revised in the light of issues raised by NZ Survey. They are intended to provide a consistent framework for the encoding of all the volumes.
Because typographical practice has changed since these texts were published, the punctuation has been modified to conform with modern practice. The notable change is the exclusion of a thin space before each semicolon, colon, exclamation mark or question mark, as well as after opening quotation marks. Where the original text has an obvious typographical error this is marked up in the electronic text.
Mark-up Standards
General
The <TEI.2 ...> tag specifies the main language for the text
(TEI, 2003: para 3.5): <TEI.2 lang="EN">
Document Structure
Each volume is divided into three major blocks, 'front', 'body', and 'back'.
The 'front' contains the material from all pages in front of the poetry. Typically, this includes the title page(s), preface, and contents. Some volumes have other 'front' material, for example, NZ Survey has a dedication and a prospectus.
The 'back' contains the material from the pages after the poetry. For example, NZ Minstrelsy has a list of subscribers, an errata, and a prospectus.
Major Divisions
Each major block is divided into 'divisions'. For example:
Title Page: <titlePage type="full" id=""></titlePage>
Other major divisions: <div1 type="preface"
id=""></div1>
The content of the 'id' attributes is described in 'ID attributes' below.
The front and back matter each have several major divisions, such as 'preface' or 'contents'. The body has one or more major divisions, depending on the structure of the volume. For example, the body of NZ Minstrelsy has two major divisions, identified as 'New Zealand Minstrelsy' and 'Appendix', whereas the body of NZ Survey has a single major division.
Subdivisions (Poems, Cantos)
Within the 'body', each poem or song is placed in a separate subdivision using the <div2> tag. For example:
- poems that don’t have a ‘tune’:
<div2 type="poem" n="" id=""></div2>
- poems that have a ‘tune’:
<div2 type="song" n="" id=""></div2>
Some poems are divided into cantos, for example, the title poem
in NZ Survey. These subdivision are identified with <div3>
tags: <div3 type="canto" n="1" id=""></div3>
In general, the major divisions of the ‘front’ and ‘back’ blocks are not subdivided unless they contain poetry.
Line Groups (Stanzas)
The <lg> tag identifies stanzas and line groups within stanzas.
<lg type="stanza" n=""></lg>
<lg type="chorus" n=""></lg>
When a stanza consists of identifiable blocks these are marked up as ‘nested’ line groups. In the following example, the mark-up has been simplified to show the line group structure:
<lg type="chorus" n="1">
<l n="1">Let's go a bushranging, thou fairest of lassies ;</l>
<l n="2">Let's go a bushranging, and visit each scene,</l>
<l n="3">Whose beauties unchanging which nought o'er surpasses,</l>
<l n="4">While clad in mantles of gay evergreen.</l>
</lg>
<lg type="stanza" n="2">
<lg>
<l n="5">The morning delights us, all nature invites us,</l>
<l n="6">To taste her enjoyments wherever we rove ;</l>
<l n="7">Then, come, let us wander where streamlets meander,</l>
<l n="8">Or through the dark forest, or pine shady grove.</l>
</lg>
<lg>
<l n="9"><seg>Chorus—</seg>Let's go a bushranging, &c.</l>
</lg>
</lg>
(Golder, 1852: 'A Bushranging')
Lines
Each line of poetry is identified using the line tag, <l>.
A ‘line’ is considered to begin at the first letter or punctuation mark, not at the page margin. This ensures that search facilities will return only the text of the poetry itself.
Example: <l n="1">Through Hutt's vale the Erratonga</l>
(Golder, 1852, 'Erratonga')
Some poems have text which is not properly part of the
poetry. To ensure that the search facilities can remove this
additional text, it is identified using the seg
element, for example (mark-up simplified):
<l n="9"><seg>Chorus—</seg>Let's go a bushranging, &c.</l>
(Golder, 1852, 'A Bushranging')
Page and Line breaks
Page breaks
Page break tags are inserted wherever there is a page break in
the original text. When a new poem begins on the next page, the
<pb>
tag is placed between the end of the previous
division and the beginning of the next:
</div2>
<pb id="pgGOLMIN027" n="23"/>
<div2 type="song" n="15" id="">
(Golder, 1852: pg. 23)
The page break number and id refer to page following the
<pb>
tag.
<pb>
tags may be inserted anywhere in the XML
file, and if the page break occurs within a poem, then the tag is
inserted between the appropriate lines and/or line groups, for
example:
<l n="16">The night-bird croaks its song.</l>
<pb id="pgGOLMIN017" n="13"/>
<l n="17">The clearing, filled with golden grain,</l>
(Golder, 1852: 'The Bushman’s Harvest Home')
Line Breaks and Hyphenation
Many output formats do not use the original line breaks and hyphenation. To keep the option of providing outputs which do, line breaks and hyphenation are marked up so that style sheets can be programmed to either respect or ignore the original layout. This is achieved using the orig element. Example:
employment, when I used to sit in my lonely bush<orig reg=" "
rend="line_break"><lb/></orig>
cottage musing over the fire in the long winter <orig reg="evenings"
rend="line_break">even-<lb/>ings</orig>. As the composing of the
several pieces then<orig reg=" " rend="line_break"><lb/></orig>
(Golder, 1852, 'Preface')
To distinguish between this and other uses of the <orig> tag
(see below), the rend
attribute is used to instruct
style sheets that these tags contain information about line breaks
and hyphenation.
Similarly, original line breaks in poetry are identified as follows:
<l n="1">Thwack, thwack, bounds the flail now on ev'ry thrashing<orig
reg=" " rend="line_break"><lb/></orig>floor,</l>
(Golder, 1852, 'The Thrashing Floor')
Epigraphs
In the Golder texts, epigraphs are usually quotations from the work of other poets. These are marked up as follows:
<epigraph rend="lindent(1.5)">
<lg>
<l><seg>“</seg>Ah ! who can tell how hard it is to climb</l>
<l>The steep, where fame’s proud temple shines afar.<seg>”</seg></l>
</lg>
<ab><hi rend="right"><name type="person">B<hi rend="smallcaps">EATTIE.</hi></name></hi></ab>
</epigraph>
(Golder, 1852: Appendix pg. iii)
Note that the quotation marks are distinguished from the text
of the poetry itself using seg
elements.
ID Attributes
The global id
attribute is used for several
purposes, including identifying the location of footnotes and
figure references and providing points to which hyperlinks can
take the reader.
To satisfy the requirement for id
attribute values
to be unique, the following system is used throughout the Golder
etexts.
The first part of each id
attribute value consists
of a prefix followed by the code for the text. The following types
of id attributes are used in NZ Minstrelsy:
Page break | pbGOLMIN.. |
Footnote | fnGOLMIN.. |
Location of footnote | fntgGOLMIN.. |
A numerical suffix completes the id
attribute
value. For example, a page break in NZ Minstrelsy is
marked up like this:
<pb id="pgGOLMIN027" n="23"/>
(Golder, 1852: pg. 23)
ID attribute values for NZ Survey use 'GOLNZS'. Values for the other two volumes are not yet defined.
Footnotes & Figure References
The TEI poetry tags place severe limitations on the type of material which can appear within a block of poetry. Nineteenth-century typesetters faced no such limitations, and could mix poetry and prose at will. For example, in NZ Minstrelsy, some pages contain footnotes which from an etext perspective appear in the middle of a poem.
Because of the difficulties of mixing poetry and prose in an XML file, items such as footnotes and illustrations (including scanned images of the original text) are separated from the poetry, and placed in a separate division at the end of the electronic text.
Figure References
Figure references are linked to the page breaks so that the style sheet can insert a thumbnail of the figure at the top of each ‘page’ of the output file.
To do this, each figure reference is placed inside a
note
block which is linked to the appropriate page
break. Example:
<note type="illustration" target="pgGOLMIN027">
<figure entity="GOLMIN027">
<figDesc>"New Zealand Minstrelsy": Page 23.</figDesc>
</figure></note>
The target
attribute matches the id
attribute of the corresponding page break.
Footnotes
Footnotes are inserted at the end of the division to which they belong. For example:
<note type="footnote" target="fntgGOLMIN02" id="fnGOLMIN02">
* Alluding to... in former years.
</note>
</div2>
(Golder, 1852: 'Stanzas, Written while on the voyage...') (text has been abbreviated)
The footnote target
attribute value matches the
id
attribute of an <anchor>
tag
showing where the footnote should be inserted in the output
file. Example: <anchor id="fntgGOLMIN02"/>
Where the text has a reference to a footnote, a hyperlink is provided.
The reference from the text to the footnote is marked using the
<ref> tag. The target
attribute of the
<ref>
matches the id
attribute of the
appropriate footnote. Example:
<l n="73">Oh happy plan!<ref type="footnote"
target="fnGOLMIN02">*</ref>—ingenuously devised!—</l>
(Golder, 1852: 'Stanzas, Written while on the voyage...')
Footnote and <ref>
id
attribute
values are numbered sequentially from 01, in the order in which
the footnotes appear in the original text. Where possible, the
numeric suffix of the <ref>
and
<note>
id
attributes are
equivalent.
Words and Phrases
Regularised Spelling
There are at least two ways in which regularised spelling can be used. First, it allows a word search to return words with abbreviated or archaic spellings. Second, it makes it possible to provide a ‘glossed’ version of the text for readers unfamiliar with nineteenth century English, or with the dialect used by the author.
A variety of methods are used to regularise spelling, depending on the word involved. 'Abbreviations' and 'Linguistically Distinct' words are covered in separate sections below.
The abbr
and orig
tags are used in
preference to their respective mirrors, expan
and
reg
. This ensures that if the XML tags are removed,
the etext remains faithful to the original.
When a word has unusual or obsolete spelling (with respect to
?Oxford, 2001?), and is not otherwise marked up, the
<orig reg="regularised spelling"></orig>
construction is used:
<orig reg="enterprise">enterprize</orig>
(Golder, 1852: 'Come to the Bush')
If the original text was capitalised, the regularised spelling is capitalised so that if the regularised spelling is substituted for the original in a specific output text, the capitalisation is retained.
<abbr expan="To escape" type="elision">T'escape</abbr>
<abbr expan="to enjoy" type="elision">t'enjoy</abbr>
Sometimes, a composite mark up is necessary, for example:
<orig reg="showed"><abbr expan="shewed"
type="elision">shew'd</abbr></orig>
(Golder, 1867: 'New Zealand Survey')
Archaic pronouns such as 'thee', 'thy', 'ye', etc. are not marked up.
When the marked word is a possessive, the entire word is marked up, for example:
<orig reg="civilisation’s">civilization’s</orig>
(Golder, 1867: 'New Zealand Survey')
Abbreviations
William Golder’s poetry contains a large number of elisions. These are marked up as in the following examples:
<abbr expan="beneath" type="elision">’neath</abbr>
<abbr expan="it is" type="elision">’tis</abbr>
<abbr expan="passed" type="elision">pass’d</abbr>
<abbr expan="dolorous" type="elision">dol’rous</abbr>
<abbr expan="to engulf" type="elision">t’engulf</abbr>
(Golder, 1852: 'Stanzas Extemporaneously written on a stormy night... ')
Elisions which involve modern personal pronouns and auxiliary verbs (he’ll; I’ll) and modern elisions (can’t, that’s) are not marked up.
One of the ways that abbreviation mark up is used is for word searching. When two words have been elided, the mark up always includes both words, irrespective of whether there is a space between them in the original text or not. This ensures that the search device can identify them as separate words. For example:
<abbr expan="The effect" type="elision">Th’ effect</abbr>
(Golder, 1852: 'A Desperate Case.')
<abbr expan="to engulf" type="elision">t’engulf</abbr>
(Golder, 1852: 'Stanzas Extemporaneously written on a stormy night... ')
Spelling and Typesetting errors and quirks
In general, unusual spellings are assumed to be intentional,
and are marked up with the orig
element.
When it is reasonably certain that the text has an error, the <sic> tag is used. Examples:
<orig reg="compelling">compeling</orig>
(Golder, 1867: 'Preface')
But:
<sic corr="scribbling" cert="low">scribling</sic>
(Golder, 1867: 'Preface')
<sic corr="message" cert="high">messsge</sic>
(Golder, 1867: 'New Zealand Survey, Canto Fifth')
The cert
attribute records the degree of
certainty. In Golder etexts, the possible values are
low
, high
, and certain
;
certain
indicating that the poet has published a
correction.
<sic corr="as" cert="certain">so</corr>
(Golder, 1852: 'Sweet Home'. Correction in Golder, 1852: 'Errata')
Where the word is sometimes spelt correctly, and occasionally incorrectly according to modern conventions, then it is marked <sic corr="" cert="high">.
Linguistically Distinct English Words and Phrases
The <distinct>
tag is used for English words
that are 'linguistically distinct'. This includes archaic words
and words from the Scots dialect, but excludes words from
languages such as Maori or Latin (see below).
The <distinct> tag has no attribute for providing a gloss in modern NZ English. Where a gloss or regularised equivalent can be found, it is tagged using the <orig> tag. This means that linguistically distinct features are usually tagged with nested <distinct> and <orig> tags. Example:
<distinct type="Scots"><orig reg="from">frae</orig></distinct>
(Golder, 1852: 'Donald's Return')
<distinct type="archaic"><orig reg="gladly">fainly</orig></distinct>
(Golder, 1852: 'The Christian's March')
Obsolete spellings such as 'pourtray', engulph, 'shew' are not considered linguistically distinct, since they are merely old spellings of current English words.
Values for the type attribute in the Golder edition are: archaic; dialect; literary; Scots. Where possible, 'archaic' and 'Scots' are preferred, since glosses from dictionaries giving 'literary' or 'dialect' forms may less accurately reflect the poet's native vocabulary.
Languages other than English
Words from languages other than English are preferably identified using the 'lang' attribute which may be applied to any tag. In cases where the electronic text will not provide an English gloss, the <foreign> tag is available. For example:
<foreign lang="LA">ad hoc</foreign>
But the <orig> tag is preferred, as follows:
<orig reg="tummy" lang="MI">puku</orig>
The lang
attribute is global and may be applied to
any tag, however, its contents must match the id
attribute of one of the languages declared in the TEI header, see
TEI, 2003: 5.4.2.
<langUsage>
<language id="EN">English</language>
<language id="ES">Spanish</language>
<language id="MI">Maori</language>
<language id="LA">Latin</language>
</langUsage>
Language id
attributes follow the codes defined in ISO 639.
Proper Nouns
<name type="" reg=""></name>
Values for the type attribute in the Golder edition are:
bibl
; country
; org
;
person
; place
; ship
.
Where appropriate, the language is also identified, using the codes declared in the TEI header. (see "Languages" above).
Hence:
<name type ="place" lang="MI" reg="Heretaunga">Erratonga</name>
<name type="place" lang="LA" reg="Scotland">Scotia</name>
<name type="place">Criterion Hotel</name>
(Golder, 1867: pg. 77)
Characters
Certain characters are specified using entity references. There are two reasons for this. First, some characters such as '&' are reserved for use by the computer system, and cannot appear explicitly in the body of the text. Second, some characters are not readily available on regular computer keyboards.
Reserved Characters
These are the ampersand and angle brackets.
The codes are:
& & ampersand
< < left angle bracket
> > right angle bracket
Single and double quotation marks are also reserved; however, these characters are not used in the body of the Golder etexts (see below).
Characters not available on regular keyboards
Quotation Marks and Apostrophes
The quotation marks available on the regular computer are ambiguous because the same character is used for opening and closing a quotation. The Golder etexts use the following unambiguous characters:
‘ | ‘ | single turned comma, left single quotation mark |
’ | ’ | apostrophe, left single quotation mark |
“ | “ | left double quotation mark |
” | ” | left double quotation mark |
Example
<l>The wisdom of its nature we’ll discuss</l>
(Golder, 1871: The Philosophy of Love, Canto First)
EM Dash
The Golder texts make copious use of the em dash, that is, a dash which is one em in width. The regular computer keyboard does not have such a character, and so it is specified using an entity reference.
<l n="15">Might he yet return? Ah! never!—</l>
(Golder, 1852, "The Penitent's Prayer")
Appendix — Interim mark-up
NZ Minstrelsy contains some interim mark-up.
Indentation
Dummy spaces inserted in front of the line of poetry simulate proper indentation. These are marked as padding characters. Example:
<l n="4"><seg rend="padding"> </seg>Clothes with grandeur both its sides ;</l>
These will need to be replaced with proper formatting facilities before significant interactivity is introduced.
Space between tags
Most HTML systems assume that white space between tags is irrelevant, and thus, where two adjacent words have been tagged, the system tends to lose the space between them. To solve this problem, non-breaking space characters have been inserted between adjacent tags. When the style sheets have been properly programmed, these characters may be removed. Example:
<l>Him, who <abbr expan="never" type="elision">ne’er</abbr> <abbr expan="listened" type="elision">listen’d</abbr> to the voice of praise,</l>
Kevin Cudby
For the Golder Editorial Group
21 January 2004
Commentary
These specifications differ from those employed in other NZETC texts in the following ways:
-
Values of id attributes should be unique. In NZ Survey, the attribute values are unique within the text, but are not distinguished from other texts. To ensure consistent linking within and between Golder texts, the naming convention described above incorporates the text code (eg 'GOLMIN') into the id attribute value.
-
In previous NZETC documents, original line breaks are marked up using the <orig> tag. We follow the same practice but, because the <orig> tag is also required for regularisation of spelling which will support search facilities and other critical apparatus, we have had to introduce a 'rend' attribute to allow style sheets to treat <orig> tags with line break information independently of others. We have investigated alternatives to this method, but these do not appear to be feasible in XML.
-
The NZETC usually puts scanned image thumbnails below the appropriate page breaks; however, with the TEI poetry tag set, this became complex. Rather than attempt to document all the rules for inserting figure references, it seems simpler to put them in their own section and have the style sheet move them into the appropriate locations. Similar issues apply to footnotes.
-
Existing etexts have a mixture of keyboard quotation marks and the preferred 'rounded' marks specified above. Techbooks are using the rounded double quotation marks but the 'keyboard' single marks.
Exceptions
-
Some elisions include highlighted text, and so it is difficult to mark up these elisions as specified above. Some testing may be required to confirm the final mark-up standard for these. This first arose with Philosophy of Love.
References
- Abrams, 1981
- Abrams, M.H.: A Glossary of Literary Terms (4th edition). Japan: CBS Publishing Japan, 1981.
- Golder, 1838
- Golder, William: Recreations for Solitary Hours, consisting of Poems, Songs and Tales, with Notes. Glasgow: George Gallie; Edinburgh: W. Oliphant & Son; London: Simpkins, Marshall & Co; Dublin: J. Robertson; 1838. A microfilm copy is held at the Alexander Turnbull Library, Wellington.
- Golder, 1852
- Golder, William: The New Zealand Minstrelsy: Containing Songs and Poems on Colonial Subjects. Wellington, NZ: R. Stokes and W. Lyon, 1852. The digitised copy is from the Hocken Library, Dunedin.
- Golder, 1854
- Golder, William: The Pigeon's Parliament; A Poem of the Year 1845. In Four Cantos with Notes. To which is added, Thoughts on the Wairarapa, and Other Stanzas. Wellington, NZ: W. Lyon, 1854. The digitised copy is from the Alexander Turnbull Library, Wellington.
- Golder, 1867
- Golder, William: The New Zealand Survey: A Poem in Five Cantoes. With Notes Illustrative of New Zealand Progress and Future Prospects. Also the Crystal Palace of 1851; A Poem in Two Cantoes. With other Poems and Lyrics. Wellington, NZ: J. Stoddard and Co, 1867. The digitised copy is from the Alexander Turnbull Library, Wellington.
- Golder, 1871
- Golder, William: The Philosophy of Love. [A Plea in Defence of Virtue and Truth!] A Poem in Six Cantos, with Other Poems. Wellington, NZ: W. Golder, 1871. The digitised copy is from the Alexander Turnbull Library, Wellington.
- Southward, 1980
- Southward, J.: Practical Printing: A Handbook of the Art of Typography. New York, NY: Garland Publishing, Inc., 1980. Facsimile edition, originally published 1882.
- TEI, 2003
- TEI P4 - Guidelines for Electronic Text Encoding and Interchange (XML-compatible edition). Edited by Sperberg-McQueen, C.M.; & Burnard, L. XML conversion by Syd Bauman, S.; Burnard, L.; DeRose, S.: & Rahtz, S. TEI Consortium, 2003: http://www.tei-c.org.uk/P4X/index.html. Visited 23 November 2003.
- Unicode, 2003
- Unicode Home Page. Mountain View, CA: Unicode, Inc.; 2003. http://www.unicode.org/. Visited 26 November 2003.
- Unicode, 2003a
- The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0. Boston, MA: Addison-Wesley; 2003.
- Unicode, 2003b
- Online version of Unicode, 2003a. http://www.unicode.org/. Visited 26 November 2003.
- XML, 1.0
- Bray, T.; Paoli, J.; Sperberg-McQueen, C. M.; & Maler, E.: Extensible Markup Language (XML) 1.0 (Second Edition) W3C Working Draft. Cambridge, MA: World Wide Web Consortium, 2003. http://www.w3.org/TR/WD-html40-970708/cover.html, visited 20 Dec 2003.