The Anglo-Norman On-Line Hub

What is XML and what use is it?  

Some answers from a Humanities perspective
 


This piece, originally intended for an undergraduate audience, dates from the year 2000. Given the pace at which
XML technologies advance, the six years that have since gone by is a very long time, and aside from a few minor corrections and additions, I have not tried to bring the material up to date. In the intervening time, XML has stopped being merely a buzz word and become a core technology wherever digitised text is created, maintained and exchanged. But for all that, many Humanities scholars whose research involves digitised materials have embraced XML only to the extent of regarding it as an incantation that they have to put on their grant applications: after that they leave everything to IT advisers who may not properly grasp the scholarly needs of the projects they are attempting to support (and why indeed should they?). There is vastly more use of XML in mainstream Humanities research projects than when this item was first written, but the full potential of the growing family of XML technologies for aiding and expanding such projects is still insufficently grasped by too many scholars. For as long as that remains so, there may be some point in keeping this elementary overview available.

Michael Beddow


It's probably superfluous to say that XML stands for eXtensible Markup Language, though the information is worth repeating as a way in to our topic. We'll start with Markup,  then move on to explore what sort of Language XML is, examining the eXtensibility that makes it crucially different from HTML, the language in which most WWW pages are currently written. The overriding purpose is to explain how that extensibility works, and why it's such a Good Thing for information handling in general and Humanities scholarship in particular.

The origins of Markup

Markup in its earliest forms was designed to regulate the process of going from an author's manuscript (hand or typewritten) to the printed page. Most academics regard that sort of thing as the business of copy-editors and compositors, the vital but mundane stuff publishers look after on an author's behalf. But in recent years, markup has acquired new possibilities and functions which bring it into the core domain of any scholarly activity that studies or creates texts. Scholars who remain reluctant to tangle with markup are depriving themselves of an important tool for carrying out and disseminating their work. They are also depriving themselves of the ability to do their own publishing and so either earn more revenue themselves from their work or, better still, make it freely available to the taxpayers who probably funded it in the first place.1

The notion of markup is much older than computers. It came about as a result of the introduction of printing via movable type in the sixteenth century. Before then, texts were individually produced in hand-written single copies. What the scribe wrote was what the reader read. But when the hand-written (or, later, type-written) text was handed over to a printer to be set in type for production and distribution to readers in printed form, control over the appearance of the finished product was handed over at the same time. So writers began adding to their actual text instructions to the printer as to how they wanted it to look on the page. As publishing developed into an industry in itself, placing an additional layer between writers and their readers, this marking up of texts was taken over by publisher’s editors. It grew increasingly formalised and sophisticated, able to specify very precisely all sorts of things that were not apparent from the text itself, and which were now regarded as the  domain of the publisher rather than the author or the printer, such as what typeface to use, what size of type was required where, how big the margins should be, where and in what form the page numbers should appear and so on.


Machines and markup: the concept of tagging

Originally, this markup was read by the printer (who in the early days of book production was also the publisher and often the bookseller as well) who then manually set up the type in the required way. Then in the 1960’s, long before computers were commonly used elsewhere for processing text, the printing industry began to use computer-controlled typesetting machines. It was now these machines that read and acted upon the markup in order to set the type, and so markup had to be adapted into a form that computers could easily read. 

A human reader looking at a typescript sees a piece of paper bearing words that represent the text, with spaces between the lines and margins at the edges. Markup information was traditionally written by hand, generally in a distinctive colour of ink, in these spaces and margins.  There the human compositor could see and act on it, without any danger of getting it mixed up with the text itself. But a computer doesn’t see text as words on paper. It just sees a continuous stream of digits in which letters of the alphabet and other textual information are encoded, and it has to be told which of these digits represent text and which of them are markup information. Various ways of doing this have been tried, but the one now most widely adopted involves "tagging” the text.


The essence of "tagging” is that sequences of characters containing markup information are separated off from the text itself by being placed between a pair of special characters. The special characters most commonly used to identify tags are the opening and closing angle brackets < and >. When the computer reading the text comes to a < it realises that what follows is markup, not the text itself, and it treats everything it reads from that point on as markup information, until it sees a >. At that point it knows the markup information has ended and that what follows is again normal text. To take a artificial example just to illustrate the principle  (real typesetter codes are less perspicuous), if the typesetting computer  finds

This is an <italics on>important<italics off> concept.

it will render "This is an” into typeset letters for transfer to the printed page, but when it sees the < it will look ahead to find the next >. Then it takes whatever comes between the < and the > as an instruction and acts on it, instead of treating it as text to be printed in its own right. Here the typesetting machine  would switch to italic type and carry on rendering the word "important” in italics until it hits the next <, when it would reads the instruction to turn off italics and acts on it, and so on.


This example illustrates an important advantage of markup using tags. Although it is designed to be understandable by machines, it can also (if the tag names are well-chosen) be fairly easy for human readers to see and understand (and therefore write and check). It represents the best compromise that has so far been found between the different ways computers on the one hand and humans on the other go about interpreting textual information.

 
'Presentational' versus 'structural' markup

As people became used to marking up texts via such tags for processing by typesetting computers, they reflected on the fact that the visual ("presentational”) features they were telling the typesetting machine to include were there because they reflected aspects of the text’s structure. Which led to the idea that instead of the tags telling the typesetting machine directly what to do ("procedural markup”), they could just as easily say what the structural role of the tagged text is ("structural markup") and that would be an important gain for anyone who wanted to use computers for more than simply converting text into type. This was the crucial thing that moved markup from the domain of the publisher's back-room into the forefront of scholarly activity. This is an essential point, and since many academics seem so far to have failed to grasp its implications for their work, it is worth belabouring it by a more extended, though still somewhat artificially simplified, example.

Let's imagine we have a typescript like this.

 
Tedium   
    
A novel 

by

A.N. Other 
      
Chapter 1 

A Day at the Zoo 
    
It was a day like any other. Except it was not. It was strangely 
different.
"Is the sky always green in Worthing?" asked Ronona.
"I’m not too sure, dear," her mother replied, wondering secretly
whether Nigel had remembered to grease the mangle.

[...]

      

Chapter 2 
 
What happened next 

For A long time, nobody said anything.
Then the lights went back on.

[...]
             


The publisher’s editor, possibly in consultation with a designer, may decide the printed version ought to look like this:



TEDIUM
A novel

by A.N. Other

Chapter 1

A Day at the Zoo


It was a day like any other. Except it was not. It was strangely different. “Is the sky always green in Worthing?” asked Ronona.

“I’m not too sure, dear,” her mother replied, wondering secretly whether Nigel had remembered to grease the mangle.[...]


Chapter 2

What happened next


For a long time, nobody said anything.
Then the lights went back on. [...]

 


Let’s see first how such a text might be prepared for a typesetting machine using the older style of "procedural" or "presentational" markup. The compositor keys in the author’s original text, and inserts tags that control how it will look when printed, following hand-written markup indicators placed on the typescript by the publisher's editor. (Note that here we introduce the convention that a tag beginning with a slash, like </this> "turns off" the effect of a tag like <this>. In a real-world example the markup would be more complex and use terser tag names.)


<Times-Roman>
><style = "bold" size = "16pt" case="allcaps">
Tedium</style>
<style = "italic" size = "12pt">
A novel</style>
<style = "normal">
by</style>
<style = "bold" size = "12pt">
A.N. Other</style>
<style = "bold" size = "14pt">Chapter 1</style>
<style = "bold" size = "12pt" >A Day at the Zoo</style>
<style = "normal" size = "11pt">
It was a day like any other. Except it was not. It was strangely different.
"Is the sky always green in Worthing?" asked Ronona.
I’m not too sure, dear," her mother replied, wondering secretly whether Nigel would remember to grease the mangle.
</style>

[...]

<style = "bold" size = "14pt">Chapter 2</style>
<style = "bold" size ="12pt">
A What Happened Next</style>
<style = "normal" size ="11pt">
For a long time, nobody said anything.
Then the lights went back on.
</style>

[...]

</Times-Roman>
 


You will see that the markup here consists of tags that  specify what typeface is to be used, and in what style and size. The typesetting computer will obey these (and other, similar) instructions and produce the desired appearance of the text on the printed page. But to a human reader, the tags provide no information other than a lot of facts about the technicalities of printing, and it is easy to see why authors, scholarly or not, would regard this type of markup as none of their business and simply not want to know about it.
But if we pause for a moment and ask:ourselves: why is "Tedium" printed in 16 point bold capital letters and centered? we know that it is because it is the title of the novel. Similarly, we know that "A Day at the Zoo" is printed in 12 point bold type and centered because it is the title of a chapter. Whereas the main body text is in a smaller, plain type, left-aligned and (probably) right-justified, precisely because it is body text. So these typographical differences are actually markers that signal to the reader where the different parts of the text are and what respective roles they play: they indicate, in other words, the text's structure in a way that points towards its meaning, matters that are of obvious authorial and possibly also scholarly concern. So we could, as an alternative, mark up the text with a quite different set of tags which were designed to make its structure explicit, like this.


<book>

  <title>Tedium</title>
  <genre>A novel</genre>
  <byline>by
<authorname>A.N. Other</authorname></byline>
  <chapter number ="1">
         
<heading>Chapter 1</heading>
    <title>
A Day at the Zoo</title>

 

<para>
It was a day like any other. Except it was not. It was strangely different
.
</para>

<para>
<quote>
Is the sky always green in Worthing? </quote> asked Ronona.
</para>
<para>
<quote>
I’m not too sure, dear</quote>, her mother replied, wondering secretly whether Nigel would remember to grease the mangle.
</para>

[...]

 


  </chapter>  

  <chapter number = "2">     <heading>Chapter 2</heading>

  <title>What Happened Next</title>

   
 


<para>
For a long time, nobody said anything.</para> <para> Then the lights went back on.</para> [...]

 

  </chapter>       [...]

</book>

 


Linking structure to presentation: style sheets

So we've now told the typesetting computer what the structural significance of the various parts of the texts is, though apparently at the cost of depriving it of what it originally needed to know, namely how we want the text to be laid out and rendered. But re-introducing such rendering instructions while preserving all the structural information can be done very simply. As well as giving the computer the marked-up text, we supply it with a set of instructions about how to render each of the components that the markup identifies. These instructions, mapping structural features to presentational ones, are generally known as a "style sheet".
Our style sheet might include entries like

title =  Times-Roman 16 point bold all caps
centered
para = Times-Roman 11/13 point
normal align-left right-justify

and so on. Equipped with such a style-sheet, our typesetting computer can now take the structural markup and produce exactly the same typography as it did from the earlier, presentational markup. It reads the name of each structural tag, looks up the desired presentational style for the stylesheet, and applies the style to the text between the opening and closing tags in question.

 
Separating structure and presentation: the benefits

What advantage does that have? First, purely from the publisher's point of view, separating structural information from direct typesetting instructions means it is very easy to change the appearance of this text, and any like it. Imagine this was one of a large series of novels, and the publishers decide that for a new edition they want a completely new look, with different layout and typefaces. With the earlier, procedural type of markup, a huge number of tags would need to be altered in the files that are used to generate each book in the series, a time-consuming and error-prone operation. With structural markup, however, all that would need altering is the one style sheet that translates structure into typesetting instructions. The actual files used to produce the books themselves could remain untouched.
But there is a more important advantage to structural markup, of relatively little interest to publishers but of vital significance to scholars who want to use computers, not merely to swell the flood of information available, but to master that flood and open up new dimensions of access and understanding to their areas of study.  Structural markup allows a computer to "know" what the various parts of the text are, as well as how we want them to look. And it allows us to make completely explicit to ourselves, as well as to our computers, what the components of our text are and in what relationship they stand to one another. As soon as we start to use tag notation to mark up structure, we have given ourselves a new tool, with both descriptive and analytical uses, known as an "element".  An element consists of an opening and a closing tag, along with everything that comes in between (which may be just text, or also include "child" elements). If we make it a rule (as XML does) that a document must have one and one only "root"  element within which the entire content of the document is contained, we are ensuring that our document will be modelled as a single outer container that holds its entire content in smaller included containers ("elements") that have names and relationships to one another which reflect their role and significance within the document.

Human readers looking at a book draw on a complex, though generally unarticulated, repertoire of cultural knowledge that tells them which parts are its title, its chapter headings, its paragraphs, its footnotes and its index . But the computer, unless we give it extra help, sees only the stream of digits representing symbols and white space. With procedural markup alone, all the computer can "know" is what size and style of type we want and where we wish to put it on the page, which means it is powerless to help us analyse the text in any useful way. Once we supply structural markup, however, the computer can recognise, say, chapter titles, for what they are by examining the element names in the opening and closing tags. So if we ask the computer to create a table of contents for us, it can scan the elements, identify the chapter headings and create a list of contents automatically. Or if we ask which words the author has used in which chapters and how often, the computer can tell us, and even manage to leave words within headings out of consideration so as not to falsify the tally.  If you compare the two examples of markup above, you will notice a number of other useful pieces of information about the text that are visible in the element names but absent altogether from the presentational tagging, even in this very simple markup scheme. Of course if we had the time, motivation and resources to tag every single word with its part of speech, as has been done for  the 4.000 odd texts in the British National Corpus (resulting in over 1.6 million tags) the amount of linguistic information the computer could extract for us would be enormous. The "meta-information" we add to the text  need not be linguistic, however. We can mark-up our documents so that editorial variants, manuscript abbreviations or lacunae, and commentary of whatever nature or complexity we find appropriate, is embedded in our documents in a form that computers can recognise for what it is and handle appropriately to our needs.

 


Markup and information overload

Structural markup began to displace presentational markup in the publishing industry in the course of the 1970’s, though the source of its attraction to publishers was mainly that it allowed them to change typesetting hardware or software without needing to alter the markup of existing files or train their editorial staff in new markup command systems. All that needed altering when new equipment was installed were the stylesheets that mapped structural markup to system commands. But for a long time, the broader potential of structural markup went, if not wholly unnoticed, then at least largely unexploited.
Even before the Internet was first devised then later boosted by the WWW into explosive growth, computers were allowing more and more text to be created and stored and accessed, yet at the same time they were making it next to impossible for human beings to keep track of all the material that was being produced, let alone locate what they wanted to know. "Information overload" had arrived. People realised that the only way to master the deluge of information that computers were delivering was to enable computers to manage data intelligently, as well as just store and deliver it. To do that computers had to "know" more about the text they captured, stored and transmitted.  They had to be made able to search and index that text in a way that was informed by "meta information" about its origins, status, scope and content and context.  And beyond that, it was important to make sure that information and knowledge, once freed from the physical restrictions of older media into digital form, should not then be imprisoned again inside one particular computer system or software package, but should be accessible to anyone who needed it, and was able to locate it, anywhere on earth and at any time, present or future.  Markup inevitably became, not just a technique used by printers and publishers, but the key to the efficient storage, location, retrieval, analysis and evaluation of information from the endlessly growing mass of machine-readable text. This was only made possible because of groundwork performed by the inventors of SGML, the Standard Generalised Markup Language, and the painstaking and ingenious efforts to harness that language to scholarly purposes performed by the TEI, Text Encoding for Interchange initiative. But what turned that groundwork into an information revolution with massive implications for scholarship was the World Wide Web, which brought home to a global constituency the power and potential of markup via tags and the kinds of access to and exchange of data which it made possible. 

 


Markup and data interchange

In the 1950's, when businesses first began storing information on computers, no-one imagined there was any need to worry about whether one computer could read another’s data. (When IBM planned their first commercial mainframes, they thought the total world demand would be for around half a dozen at the most). Nor did anybody think about the need to store text in any language other than US English. The consequences what seems now with hindsight an early lack of imagination are with us still. Most people agree it would be a good thing if text in any human language could be stored on any computer and be straightforwardly transferred to and read on any other, including computers yet to be designed and built. The reality is that different manufacturers and software houses have developed incompatible ways of storing and transferring data, and until very recently they thought they had a vested interest in keeping things that way: if your precious files can only be read using the particular program or computer you used to create them, you will think twice before changing to a competitor’s product, even if it seems to be better value or more efficient. What forced industry leaders into taking data compatibility and transferability seriously in practice as well as rhetoric was the emergence of the World Wide Web.
The WWW grew out of the wish of a group of scientists to have a simple way of  seamlessly sharing textual information (graphics and sound came only later) which resided on a number of different computers having little in common apart from all being connected to the Internet. To do this they had to work out ways of representing text that made sense to a variety of computers and which could be transmitted without loss or corruption across all sorts of network transport systems. Their solution had two components. Drawing on SGML, they devised HTML, a HyperText Markup Language, which defined a set of elements and rules for using them within a documents that allowed any sort of computer (and any sufficiently informed human being), to understand the basic structure of  those documents and be able to retrieve other documents, possibly at other locations, to which they referred via "hyperlinks". And building upon the core TCP/IP protocols that drive the Internet, they devised  http, a HyperTextTransfer Protocol that defined a communications standard, applicable on all platforms, to allowed such links to be followed across interconnected networks to fetch the documents concerned, even where they were being transferred between computers of very different type and crossing intervening networks that used all sorts of local standards.

The success of the WWW showed users that the incompatibilities between computer systems which manufacturers had often claimed were inevitable, or conducive to competitive innovation, or both,  were in fact needless barriers that given the will could be relatively easily overcome to the general benefit: and it showed hardware and software manufacturers that data interchangeability, far from damaging their commercial interests, actually fostered a huge increase in the use of computers and so in the sales of their products.

 


Markup and extensibility

As the WWW began to catch on, people outside the immediate circle who first developed it wanted their documents to contain more than plain text with simple links. They wanted multiple fonts, complex layouts, colour, sound and animations and even full-motion TV-quality video. And various firms set about making their own proprietary modifications to HTML to allow these features to be incorporated. The result was that, for a time, the WWW looked like fragmenting again into mutually incompatible groups, destroying the vision of universal data interchange it had so recently called into being. The desire to avoid this happening, without putting obstacles in the way of further innovation and evolution of the WWW’s capabilities, led to the creation of XML and the widespread will to adopt it as the basis for the Web in succession to HTML.

 
XML versus HTML: language and  meta-language

HTML is a purely and simply a language for marking up documents so that they can become part of the World Wide Web: to achieve that, it offers a combination of a defined vocabulary with a specific syntax. Both vocabulary and syntax are laid down by an authoritative, though hardly authoritarian, body, the World Wide Web Consortium (W3C), in freely available documents. To use HTML for its specified purpose (and it has no other) you have to learn its vocabulary and respect its syntax rules (or else employ a "user friendly" system that encapsulates the requisite knowledge and makes sure you obey the rules, even if it frees you from the need to be fully aware of what they are). Hence the problems that arose when people wanted to enhance the appearance of their Web sites and expand their offerings by expressing things in HTML for which the existing language lacked either the vocabulary or the grammar. Because HTML has a predetermined set of element names with predefined meanings which can only be used in sequences or combinations specified in the language standard, if you want to do something which the language doesn't currently allow for you have to opt for one of three possibilities, all equally undesirable and none of them in any case feasible unless you have the resources of AOL/Netscape, Microsoft or the like behind you::

  • Lobby other interested parties and then the W3C standards committees to accept new elements or attributes of your devising into the language, a tiresome and expensive bureaucratic process which takes so long that even if you do eventually get your suggestions added to the HTML standard, you’ve probably long since forgotten what you wanted to use them for;
  • misuse existing elements or attributes by deciding to make them do something different from what they are meant to and distributing a browser that treats them according to the new meaning you have given them: this makes your documents behave more or less oddly when viewed using a browser that interprets them in the originally intended way
  • invent new elements or attributes of your own and distribute a browser that knows about your inventions, meaning that your documents will be unreadable in other browsers

Now XML, like HTML, is a language, and one specified by the very same W3C at that. But unlike HTML, it is not in itself a language for marking up documents to achieve a pre-defined purpose. Rather (like SGML itself) it is a language in which other markup up languages can be defined, or in more jargon-intensive terms, it is a meta-language. XML has its own, very strict and firmly defined, rules, just as HTML has. But provided you obey those rules, you can use XML to define your own vocabulary and (within certain limits) your own syntax to suit your own purpose, whatever that may be, so you can use it to create element and attribute names, and rules for how and where they may occur in your documents,  which do exactly what you want in the way you want to. Provided you stick to the ground rules, you can make your element names be and mean what you like and determine as strictly or a permissively as you wish the combinations, sequences and contexts in which they may be used.This decisive difference between a fixed language and a meta-language that enables extensible user-specified languages to be created in a systematic way often gets lost, or surfaces in a way that confuses rather than enlightens people, because XML and HTML can apparently  "look the same".  Suppose we have two documents, one marked up in HTML, the other in XML.  Both might contain the line 

<title>King of England</title>

If you find that in an HTML document which obeys the rules of that language, you can immediately know quite a lot about what the <title> element means and what its place is in the overall pattern of the document. It designates the title of an HTML page, that is to say, the text most browsers show in the outer border of their window, or which a search engine may use to label a page that matches a query. You also know that there should be only one part of the document tagged in this way, since the HTML rules allow for only a single <title> element per document; and you would know something about the location of the line within the document, because the HTML syntax rules specify that the element named <title>, if present at all, must be contained within the <head> element of an HTML document, not within its <body>. So, if someone misunderstood the meaning of the <title> element in HTML and tried to place it in the main part of their page to indicate a title that was supposed to display prominently at the head of their body text, they would not see what they expected (unless their browser was very lax indeed in its interpretation of the HTML rules). 

But if exactly the same line occurs in an XML file, none of that necessarily applies. With nothing but the line alone to go on, there is no way of saying what <title> means or what it tells us about the nature or function of the text it encloses. It could indeed be meant to play the same part, possibly even with the same context constraints, as its HTML namesake. But it could with equal likelihood indicate the title of, say, the biography of a monarch in a bibliographical listing; or it could be designating, not the title of either a page or a book, but one of a number of offices held by a given person, so that we might find it along with <title>Elector of Hanover</title> in a text dealing with King George I. That doesn't of course signify that an element named  title can "mean anything at all" in XML: it merely emphasises that what it means (and where it can and cannot occur) in a particular XML document  is a matter that was decided by whoever devised the specific "application" of XML which the document embodies and specified its "vocabulary" 2. To understand the meaning and role of the element, we have to know more about the "application" of XML that the document exemplifies.

 
Explaining and constraining XML markup: the DTD

It is perfectly possible, with short documents using a simple markup scheme, to rely on the XML markup itself to make its own meaning and structural principles plain. Supposing the immediate context of the line we considered earlier was something like this:

 
<book ISBN="1-234-56789-2">
  <title>
King of England</title>
  <subtitle>
A Monarch's Destiny</subtitle>
  <author>
Geraint Marvin</author>
  <publisher>
Royalist Books</publisher>
  <price currency="
USD">39.90</price>
</book>
 


it would be rather churlish to claim that the designer of this particular XML vocabulary had left the meaning of the element names unclear, or to affect puzzlement at why the other elements were all contained within the element named
book. This is the informal sense of what is meant when XML data is termed "self-describing", though more rigorous definitions of that term are complex and in places problematic.

But the freedom which XML gives to devise and name elements as we wish can lead to another hazard, alongside possible unclarity of meaning: it can allow us to be inconsistent in our own use of our chosen vocabulary, creating problems when we want to access our information and exchange it with others. The ideal here is to exploit the power which XML's flexibility gives us (letting us mark up our texts to suit our perceptions of their structure and meaning) while also keeping ourselves in line and our data in order (by ensuring that having once defined a set of element names and rules for their use, we subsequently employ them consistently and correctly in terms of the principles we have ourselves laid down).  The technique (currently at any rate) for constraining ourselves within such self-imposed bounds is the use of a DTD, or Document Type Definition.

A DTD defines in a rigorous and abstract way the language we have designed or chosen to express the meaning and structure of our documents. In a strict notation (unfortunately rather hard for the uninitiated to understand or write), it specifies what our elements are called, what attributes they can or must have (there are examples of attributes above, where ISBN is an attribute of the element book and currency is an attribute of the element price, and each of them has been assigned a specific value) and where, in what numbers and in what order, they are allowed to appear. Having marked up our document, we can then run a program called a "validating parser" which will read the DTD and check that our document conforms to it in every particular. If it does, we say that our document is "valid in terms of its DTD" or more simply, but rather misleadingly, "valid". (Misleadingly, because a document that doesn't have a DTD and therefore cannot be validated, is not by that token "invalid" or necessarily in any way defective.  The only "invalid" documents are those that possess a DTD and have failed a test of conformity to it.) 

 


DTDs,  bespoke and prêt à porter

Designing a DTD is not for the faint-hearted or inexperienced. The Humanities constituency for whom this paper is written will probably have needs either so simple that a DTD is not necessary, or so complex that outside assistance is required to obtain or devise one. Fortunately, such assistance is at hand from various sources, of which the one most likely to be useful and relevant to Humanities scholars is the TEI
It is impossible to give a brief yet just account of the TEI that would cover both the worth of its achievements since its foundation and, more significantly, the value of all the additional benefits it could bring to Humanities scholarship if only the people who badly need its work had a better understanding of  what they need, and how the TEI's offerings could help. A look at the TEI's own sites does not immediately bring all the enlightenment one might expect, partly because, like all grant-dependent bodies in the subtly corrupting "Bidding Culture" that poisons modern Academia, it is tempted to emphasise its own ambitions and achievements and understate its limitations, since candour or even diffidence in such matters can bring insolvency in its wake. And not all the many projects with Websites exhibiting work that ostensibly follows TEI guidelines are as informative as they initially appear, because all too often the examples of markup shown reveal an inadequate understanding of what following TEI recommendations actually entails. Sometimes it seems that "TEI-compliance" has been claimed simply to achieve the award of a grant, and has been certified by assessors who know even less about the matter than the applicants. That being said, no-one who wants to unlock the potential of XML markup for their work in Humanities scholarship can or should avoid a thorough engagement with the materials and advice offered by the TEI, and that means exploring those materials and seeking that advice first-hand, not relying on the guidance of a local "Humanities Computing Adviser" who may know next to nothing about XML and have a vested interest in keeping that ignorance well hidden. (Hint: anybody who really knows about XML must have a very strong reason for tolerating the meagre salary and dubious status of the average Humanities Computing Adviser when XML expertise is in such huge demand in commerce and industry, and even in some parts of the Academy.) The TEI makes freely available an XML version of the "TEILite" DTD (though the accompanying documentation has not yet been adapted to the XML version, and the examples given are occasionally XML-incompatible in ways a novice would not readily notice) as well as an ingenious, if temperamental, interactive mechanism for constructing a purpose-built XML DTD for documents where the markup possibilities offered by the "lite" version are inadequate, drawing on the huge repertoire of the full TEI tagset.  An extensive revision of that tagset's documentation to ensure that it takes full and explicit account of the differing requirements of SGML and XML has now been completed and published as version P4 of the TEI Guidelines. It is freely available in html form, while a pdf version is available to TEI subscribers and members, and hard-copy versions may be purchased: particulars are on the TEI sites.  Sebastian Rahtz  of Oxford University provides, and regularly updates,  a powerful and flexible, though somewhat under-documented, set of XSLT 3  style sheets for transforming documents marked up in the XML version of  TEI-Lite into HTML for display on the WWW or from a CDROM, and anyone with sufficient understanding of XSLT should have no difficulty in adapting this material to handle portions of the full TEI tagset that are not included in the TEILite offering.But there is something else, of inestimable worth, that the TEI offers. Through the documentation of its tagset proper, the published papers tracing the history of that tagset's evolution, and the regular contributions made by TEI practitioners to Internet discussion lists, it makes available a huge repertoire of hard-won experience, careful reflection and informed debate covering the many and various problems that surface (and re-surface) when trying to mark up all sorts and conditions of documents for a huge variety of scholarly purposes. That repository of wisdom and experience should itself suffice to make the TEI's materials the vade mecum of anyone using XML in Humanities scholarship, even if in due course they conclude that the specific approaches and solutions suggested by the TEI do not in the end meet their encoding needs.4

 


So what can XML do for Humanities scholars?

This site, much less this page, has no ambition to give a comprehensive answer to that huge question. This particular page was designed simply to give a basic introduction to what XML is and why it matters in general terms, though from a Humanities scholarship perspective. Beyond that, the remainder of this site is in itself evidence of what  XML and its related technologies has done and is doing for one particular group of scholars. The possibilities truly are endless, but, as ever, there's work still to be done...

Michael Beddow

 


NOTES

 


1
This is, understandably enough, something publishers DON'T WANT YOU TO KNOW. And they've so far managed quite cleverly to stop you knowing it by pre-emptively engaging some of the best academics in the text management field as their "electronic publishing consultants". These consultants supply publishing firms with the necessary know how (often acquired at public expense) to publish on line, and also make sure that ordinary academics swallow the dogma that electronic publishing is really just as difficult and resource consuming as paper publication and distribution and so definitely a matter for expensive "experts" only.   Cui bono? one wonders, especially since after the first flush of enthusiasm many scholarly publishing houses have become less and less interested in electronic means of distribution.
  BACK

 


2
  The terms "XML application" and "XML vocabulary" can give rise to unfortunate misunderstandings among those not familiar with markup jargon. Most people know the usage of the word "application" to indicate a particular type of computer program, such as a "word-processing application", or to distinguish between "applications software", the sort of program a user runs to do a specific job, and "systems software" which applications call on, transparently to the user (except when they have to reinstall Windows yet again), to handle common tasks such as file storage and screen display etc. This leads to the misconception that an "XML application" must be a program designed to do something to or with XML data. In fact "an XML application" means "the results of applying the rules of XML so as to produce a markup language for a document", and in an attempt to keep that meaning clear and distinct, XML documentation refers to programs that do things to or with XML as "XML processors". Possibly as a side-effect of ensuing confusions, the term "XML vocabulary" has come to be used within the XML teminological domain synonymously with "XML application" (so that we find references to an "XML vocabulary" for the automobile industry, meaning a set of XML tags and rules for their use specially designed to meet the needs of that particular branch of manufacturing). Though not as intrinsically misleading to newcomers as "XML application", the term "XML vocabulary" rather understates the fact that such "vocabularies" of necessity also specify the syntax of the language concerned as well as its lexis. BACK

 
 


3  XSLT is an initially somewhat difficult but extremely powerful language that allows any XML document to be transformed either into another variant of XML, or into HTML for display as a WWW page. Some day, standard WWW browsers will be able to handle documents marked up in XML "as is". Internet Explorer 6, Mozilla (and Netscape, now defunct at Version 7.1) have made significant advances, though in different ways, in this direction.  But for the time being, we have to convert our XML documents into HTML before users with current browsers can view them; and, for reasons explained later in this note, we shall probably always want to perform some manipulations on our canonical XML documents before delivering them to end users, even if those users eventually all have browsers that do not require conversion to HTML  Bob DuCharme's XSLT Quickly was the first book on XSLT to combine general accessibility with scrupulous accuracy, and it remains strongly recommendable even after the appearance of Jeni Tennison's Beginning XSLT, which manages to be both a lucid introduction and (despite its title) a handbook of advanced techniques. Unfortunately, new copies are in very short supply, though on-line dealers can supply second-hand copies. No-one who becomes seriously involved with XSLT will be able to survive without Michael Kay's superb XSLT Programmer's Reference; but although a model of concise clarity in its own terms it is more likely to frighten raw beginners off for life unless they come to it via a less rebarbative introduction.

This is not to overlook the possibility of exclusively targetting recent browsers and sending them "pure" XML documents in conjunction with a CSS style sheet. Impressive samples of this technique are provided by George L Dillon at Washington State, a pioneer of delivering TEI-conformant XML direct to CSS-compliant browsers (as a replacement for the earlier system of delivering SGML to proprietary client systems). [Sebastian Rahtz makes a CSS stylesheet by George L Dillon, suitable for viewing documents with TEI-lite markup in an appropriate browser, available alongside his own XSLT materials on the Oxford site, or at the TEI home site in Virginia.] The edition of Sir Gawain and the Green Knight at Washington State, for instance, viewed in a suitable browser on a system with adequate font support, shows how beautifully and clearly the main text of the poem can be rendered straight from the TEI markup. But it also shows some of the limitations of the xml + CSS technique compared to delivery involving XSLT transformations. First there is the point that older browsers cannot access this material at all (and this is a problem for academic texts precisely because the inertia and dogmatism of campus computing services world-wide means that universities are always bastions of browser obsolescence) although this is admittedly an advance on delivery from SGML markup, where no standard browsers could be used "out of the box". Secondly, the top part of the document, which displays most of the content of the TEIHeader section,  illustrates the main limitation of CSS compared to XSLT: it cannot by itself control the ordering or selection of material to be displayed (at best it can style specified elements as invisible, but it cannot trim them down or alter their location relative to other parts of the document). XSLT could have moved much of this information out of the immediate gaze of the reader, without making it inaccessible. Again, and particularly relevant perhaps to the display of a poem or play, CSS cannot itself perform the sort of calculations that are needed to display marginal line numbers in a variable or user-configurable way. The lines are indeed numbered in the markup, but CSS offers no easy method of converting that markup into unobtrusive numbering by increments of, say, five or ten before displaying the text. Finally, CSS lacks the capability which XSLT possesses of converting linking information present in the XML into links that current browsers can follow (unless html markup is post-edited into the XML itself at such points, as has been done on the Washington State site), or of adding links to a text as it is rendered (in order, for instance, to create a hyperlinked table of contents derived from headings within the document and prepend it to the text). To benefit from the full power of XML related technologies, CSS is best applied to complete the visual rendering process after XSLT has transformed the original XML.

A third possibility has recently been opened up by the ready availability of browsers that have reasonably full implementations of the W3C XML DOM, allowing extensive client-side manipulation of the displayed HTML in real time using standards-compliant scripting, without any browser or platform-specific programming being needed. (Though IE6 retains many proprietary features, it is not hugely difficult to write DOM manipulation routines that work with both IE6 and with more fully standards-conformant browers.) This means that aspects of TEI-conformant markup that have up to now been impossible to render in freely-available browsers (e.g. markup of thematic annotations that cut across the basic encoding hierarchy of the document) can, via XSLT transformations and real-time DOM manipulation, be visualised and explored interactively using standard browsers and current technologies. The major obstacle to this potentially revolutionary advance in Humanities computing is the allegiance of many campus computer centres and "experts" in WWW authorship to the now seriously inadequate 4.7 series of Netscape browsers. Netscape itself, long prior to its demise, did not take this view, as witness the fact that Netscape 6 made no attempt whatever to retain backward compatibility with Netscape 4.7 for precisely those features and bugs in NS4.7 which its academic partisans claim it is essential to support. A further threat to the full exploitation of these technologies comes from the carelessness and complacency that has filled the world with Internet-connected systems vulnerable to remote attack and the mindless destructiveness of those who are bent on exploiting this vulnerability. As a (pretty ineffectual) defence against these evils, some individuals and institutions are disabling the ability of their browsers to execute programs, and so making full interactivity difficult to achieve. BACK

 
  
The Anglo-Norman On-Line Hub Funded by AHRC