Connect

    mail icontwitter iconBlogspot iconrss icon

Technology

Technology

Introduction

XML and TEI are the document mark-up standards which underpin the work of the NZETC. Information on TEI can be found through the Text Encoding Initiative. Other key technologies used at the NZETC include:

  • XTM. XTM (XML Topic Maps) is the framework that topics in our texts are harvested into.
  • EATs. Entity Authority Toolsets or EATs is the toolset that we use to create entities (or topics). EATs also allows us to express relationships between entities. At present we have five different entity types; people, organisations, works, places and ships.
  • Apache Cocoon. Cocoon is the xml publishing framework that we use to publish this website.
  • Apache Tomcat. Tomcat is a Java servlet container that runs Cocoon.
  • Apache Solr. Solr is the platform we use to allow faceted searching of the NZETC collection.

More information is given below.

XML Topic Maps

Books, images, and collections are navigable through a dynamically-generated semantic framework, which represents the first release of a large-scale XML Topic Map (XTM) site in New Zealand. Users are able to move around the resources on the site tracking topics of interest rather than merely browsing the material linearly or through text searching. In a topic map, web-based resources are grouped around items called "topics", each of which represents some subject of interest.

Topics in a topic map are linked together with hyperlinks called "associations". There can be different types of association in a topic map, representing the different kinds of relationship in the real world. For instance, in the NZETC topic map, the topic which represents a particular person may be linked to a topic which represents a chapter of a book which mentions that person. This association is labelled to indicate that it represents a "mention". Similarly, the same person's topic might be linked to a particular photograph topic, via a "depiction" association.

To construct our topic map, we use XSLT stylesheets to extract metadata from each of our XML text files, and express it in the XTM format. In this way we automatically create hundreds of topic maps that describe our texts. We also harvest information about the entities contained in EATs. Finally we merge the harvested topic maps together to create a unified topic map which describes our entire website.

Each page on the website represents one of these topics, along with any associated topics. We use the open source TM4J Topic Map engine for merging and querying our topic map.

The Topic Map framework for the NZETC website was presented at the launch of the new information architecture on 5 May 2005. PowerPoint slides from the presentation are available. Papers on the NZETC technical infrastucture are available through the Victoria University ResearchArchive

Apache Cocoon and Tomcat

We use an XML publishing framework called Apache Cocoon to publish the NZETC website.

Cocoon is a Java servlet and hence it can be deployed on a wide variety of systems. We run Cocoon inside the Apache Tomcat servlet container (the official reference Implementation for the Java Servlet specification), using JVM version 1.6 from Sun Microsystems.

Cocoon offers a flexible environment based on the separation of concerns between content, logic and style. Cocoon can deliver documents in a variety of formats, including HTML, PDF, RTF, SVG, JPEG, PNG, and any other XML-based format. We use Cocoon to transform our XML texts into readable documents using XSLT stylesheets.

Cocoon can perform these transformations on demand; i.e. when a request is received from a web browser. Each request is handled by reading the appropriate XML document or documents, and processing the XML data in a succession of stages, first applying logical, then presentational transformations. Each stage is distinct and can be effectively managed by different people. Our web designer can edit the look of the site, the web developer can edit the structure of the site, and the text-editors can edit the content of the site (the e-texts), all independently of each other.

Apache Solr

We use Solr for faceted searching of our collection. The Solr search engine is a Java based engine and runs inside our Tomcat servlet container.