The Legal Māori Corpus: texts printed before 1910
A Brief Introduction to the Legal Māori Corpus
The pre-1910 section of the Legal Māori Corpus is now available to researchers via the NZETC website
This document provides a brief introduction to this part of the Legal Māori Corpus, and an overview of the whole corpus set into the context of its parent project, the Legal Māori Project. A more detailed description will be made available here at a later date.
The Legal Māori Project
The Legal Māori Project sets out to collect, describe and analyse the legal domain as it is expressed in te reo Māori, from the time of colonisation to the present day. The current phase focuses on printed written texts from the time of the earliest written records. Later the aim is to collect and study spoken legal texts and other types of written texts in Māori.
The current project phase (2008-2012) was funded by FRST and Victoria University. It sets out to produce a dictionary of legal Māori terms for Western legal concepts, based on a large corpus of printed written texts in Māori. On the route to the dictionary the project has produced the Legal Māori Archive, the Legal Māori Lexicon and the Legal Māori Corpus.
He Pātaka Kupu Ture — The Legal Māori Archive
In June 2009 He Pātaka Kupu Ture — the Legal Māori Archive was launched. This is a large collection of legal texts in Māori, digitised and publicly available via the New Zealand Electronic Text Collection: http://nzetc.victoria.ac.nz/tm/scholarly/tei-corpus-legalMaori.html. A description of that part of this project can be found there.
The Legal Māori Lexicon
The lexicon is a list of Māori legal terms derived through several intersecting processes. It is derived partly from the pilot section of the Legal Māori Corpus, and partly from expert reading of legal texts in Māori and of dictionary entries from existing general dictionaries and wordlists of Māori. The Legal Māori Lexicon is in the form of a wordlist in Māori, with English glosses. Each of the terms on the lexicon list will be researched using the 8 million word Legal Māori Corpus. More information on the derivation of the lexicon is available, as is the lexicon itself.
The Legal Māori Dictionary
The Legal Māori Dictionary will be completed in manuscript form by early 2012. It is to be a bilingual dictionary, with headwords in Māori, glosses in English, and examples of use for each word and word sense in Māori, taken from the Legal Māori Corpus. It will contain legal Māori terms and also more general ‘legal language’, those words that are typical of, and occur more frequently in legal language than in the language in general. These words are rather like an equivalent of the Academic Word List of English in the legal domain of the Māori language.
The Legal Māori Corpus (LMC)
The Legal Māori Corpus (LMC) has been designed and compiled to provide evidence of the use of Māori terms for Western legal concepts. It was designed as a large lexicographic corpus (approximately 8 million words of legal Māori texts) to provide information to underpin the writing of entries for the Legal Māori Dictionary.
The Legal Māori Corpus is essentially complete as of 30 June 2010. As of this date the texts in the corpus that were printed before 1910 are able to be released. These are the texts that are now clear of Crown copyright. The remaining texts that can be cleared of copyright and confidentiality restrictions will be made available in early 2012.
The contents of the Legal Māori Corpus Pre-1910 Corpus collection
The LMC Pre-1910 Corpus collection is now publicly available for research purposes.
It should be noted that the LMC Pre-1910 Corpus collection is not identical to the contents available in the Legal Māori Archive. The set of texts in the two repositories does not match completely. In addition, the LMC Pre-1910 Corpus collection is presented in plain text (.txt) files so that researchers are easily able to download the corpus and more easily search the whole collection or parts of it for their own purposes.
The Legal Māori Corpus as a whole
The texts making up the LMC have been compiled from the earliest printed legal texts up to those published in the present day. The LMC Pre-1910 Corpus collection makes up approximately 66.8% of the total corpus. More detailed information about each of the pre-1910 texts is available in the accompanying spreadsheet.
The following tables show the size and contents of the LMC and the LMC Pre-1910 Corpus collection as at 30 June 2010. It can be seen from the first table that the whole LMC contains approximately 8 million words of running text in Māori.
|LMC full collection||Hyphens separate words||Hyphens do NOT separate words|
|tokens (running words) in text||8,141,323||8,091,256|
|tokens used for word list||7,895,422||7,847,380|
|types (distinct words)||58,315||64,021|
The figures in these tables have been generated by WordSmith Tools 5.0 (developed by Mike Scott) using the ‘English Removed’ set of texts, that is, those with all English words removed; set to ignore items between angle brackets, <*>, with a tag span of 4000.
The discrepancy between the number of tokens in the texts and the tokens used for the wordlist in the table above represents the number of numerals that have been removed in processing.
The number of types or distinct words identified in the table is much larger than the number of actual distinct ‘real words’ of Māori, as the figures here include all the spelling variants, orthographic inconsistencies and digitisation errors.
The two columns present different counts, depending on whether WordSmith Tools 5.0 Wordlist tool is set to have words break at hyphens or not.
The settings selected when running data through the tools in WordSmith affect the data output in various ways.
The LMC Pre-1910 Corpus collection
The following table shows the size of the LMC Pre-1910 Corpus collection. These are the files now publicly available for research purposes. There are approximately 5.24 million words of running text in this collection. This represents 66.8% of the total corpus.
|LMC Pre-1910 Corpus collection||Hyphens separate words||Hyphens do NOT separate words|
|tokens (running words) in text||5,412,850||5,380,979|
|tokens used for word list||5,274,755||5,243,760|
|types (distinct words)||40,250||43,347|
The two sets of figures in the table above demonstrate that the statistics on the number of tokens in a corpus, and the number of types or distinct words depend on what the user decides is a word. For our data runs, we have decided to use the setting in WordSmith Tools 5.0 that retains hyphens within words when counting. This means that the word type TAHI-MANO-WARU-RAU-WHITU-TEKAU-MA-IWA is counted as a single word type, and each occurrence of this string is counted as a single token. If the setting was changed to allow hyphens to break words, this would be counted instead as eight different word types, and the frequency of each of the component word types (tahi, mano, waru, rau, whitu, tekau, ma, iwa) would increase by one, and the total tokens in the corpus would increase accordingly. Te Taura Whiri i te Reo Māori Orthographic Conventions set out conventions for the use of hyphens in contemporary texts, but the standards for the use of hyphens in written Māori have varied over time. Retaining hyphens when generating our figures better represents the words as written in the original texts. It is an easy matter to run the LMC through a program like WordSmith Tools 5.0 with it set to break words at hyphens if desired.
The LMC Pre-1910 Corpus collection folders
Two versions of the pre-1910 set of corpus texts are available; one set includes all the English words interspersed in Māori text, and the other has had all the English items removed. All the files are presented in plain text (.txt) format.
The ‘English Removed’ set was used to generate the word lists and the tables showing the size of the corpus.
The ‘English Retained’ set is better used for searching for concordance data, that is, examples of a target word in the contexts it appears in in the corpus.
The ‘English Removed by Text Category’ set: the LMC Pre-1910 Corpus collection is also available in a series of folders which sort the 424 files according to their text category. The amount of text in each category is shown in the following table. More details of the text categories and their contents can be found on the Legal Māori Archive site. The six text categories are:
|LMC Pre-1910 Corpus||Files||tokens in text (total words, numerals)||tokens used in list (total words)||types (distinct words)||numbers removed|
There is mark-up present in all the files, as shown in the table below. The items between angle brackets form the mark-up. You will notice that the mark-up in the version that retains the English is more detailed than that in the Māori only version. Whether the mark-up is visible in data output from WordSmith Tools depends on the settings selected by the user. Even when mark-up is not obvious, for example in concordance data, it can be seen by viewing the files from which each concordance line is derived.
|Legal Māori Corpus Excerpt from 1907Kah showing mark-up|
|1907Kah.txt, English and mark-up retained||<p xml:lang="mi">1 He hoko (3-277) 17 o nga ra o Oketopa, 1907 Matarau Nama 2, Wahanga 3 Ruira Hauteepa ki a <span xml:lang="en">Agnes Lydia Williams</span>.</p>
<p xml:lang="mi">2 He mokete (3-277) 14 o nga ra o Hepe-tema, 1907 Tahoka 2, 4, me 5, me Taruheru <span xml:lang="en">F, G, H</span>, me <span xml:lang="en">L</span> Mahaki Paraone raua ko Rawhiti Paraone ki a <span xml:lang="en">Christopher Pearson Davies</span>.</p>
|1907KahM.txt, Māori only version: all English removed, mark-up retained||
<p> 1 He hoko (3-277) 17 o nga ra o Oketopa, 1907 Matarau Nama 2, Wahanga 3 Ruira Hauteepa ki a . </p>
<p> 2 He mokete (3-277) 14 o nga ra o Hepe-tema, 1907 Tahoka 2, 4, me 5, me Taruheru , me Mahaki Paraone raua ko Rawhiti Paraone ki a . </p>
|Legal Māori Corpus Excerpt from 1907Kah with mark-up removed|
|1907Kah.txt English retained, mark-up removed||1 He hoko (3-277) 17 o nga ra o Oketopa, 1907 Matarau Nama 2, Wahanga 3 Ruira Hauteepa ki a Agnes Lydia Williams.
2 He mokete (3-277) 14 o nga ra o Hepe-tema, 1907 Tahoka 2, 4, me 5, me Taruheru F, G, H, me L Mahaki Paraone raua ko Rawhiti Paraone ki a Christopher Pearson Davies.
|1907KahM.txt Māori only version: all English removed, all mark-up removed||1 He hoko (3-277) 17 o nga ra o Oketopa, 1907 Matarau Nama 2, Wahanga 3 Ruira Hauteepa ki a . 2 He mokete (3-277) 14 o nga ra o Hepe-tema, 1907 Tahoka 2, 4, me 5, me Taruheru me Mahaki Paraone raua ko Rawhiti Paraone ki a .|
Features of the files: orthographic inconsistency and error
Users of the corpus collection will notice that there is a degree of orthographic inconsistency in the ways that individual words are rendered. Some of this inconsistency is in the original files, and some has been introduced by the digitisation process, which included re-keying of some texts and the use of OCR software for others.
Some of the inconsistency can be labelled as ‘acceptable variant forms’ and some is error. A fuller description of the inconsistencies will be available here shortly.
Accessing and acknowledging use of the Legal Māori Corpus Pre-1910 Corpus collection