What is XML and what use is it? Some answers from a Humanities perspective |
|
|
Michael Beddow |
|
The origins of Markup Markup in its earliest forms was designed to regulate the process of going from an author's manuscript (hand or typewritten) to the printed page. Most academics regard that sort of thing as the business of copy-editors and compositors, the vital but mundane stuff publishers look after on an author's behalf. But in recent years, markup has acquired new possibilities and functions which bring it into the core domain of any scholarly activity that studies or creates texts. Scholars who remain reluctant to tangle with markup are depriving themselves of an important tool for carrying out and disseminating their work. They are also depriving themselves of the ability to do their own publishing and so either earn more revenue themselves from their work or, better still, make it freely available to the taxpayers who probably funded it in the first place.1 The notion of markup is much older than computers. It came about as a result of the introduction of printing via movable type in the sixteenth century. Before then, texts were individually produced in hand-written single copies. What the scribe wrote was what the reader read. But when the hand-written (or, later, type-written) text was handed over to a printer to be set in type for production and distribution to readers in printed form, control over the appearance of the finished product was handed over at the same time. So writers began adding to their actual text instructions to the printer as to how they wanted it to look on the page. As publishing developed into an industry in itself, placing an additional layer between writers and their readers, this marking up of texts was taken over by publisher’s editors. It grew increasingly formalised and sophisticated, able to specify very precisely all sorts of things that were not apparent from the text itself, and which were now regarded as the domain of the publisher rather than the author or the printer, such as what typeface to use, what size of type was required where, how big the margins should be, where and in what form the page numbers should appear and so on.
Originally, this markup was read by the printer (who in the early days of book production was also the publisher and often the bookseller as well) who then manually set up the type in the required way. Then in the 1960’s, long before computers were commonly used elsewhere for processing text, the printing industry began to use computer-controlled typesetting machines. It was now these machines that read and acted upon the markup in order to set the type, and so markup had to be adapted into a form that computers could easily read. A human reader looking at a typescript sees a piece of paper bearing words that represent the text, with spaces between the lines and margins at the edges. Markup information was traditionally written by hand, generally in a distinctive colour of ink, in these spaces and margins. There the human compositor could see and act on it, without any danger of getting it mixed up with the text itself. But a computer doesn’t see text as words on paper. It just sees a continuous stream of digits in which letters of the alphabet and other textual information are encoded, and it has to be told which of these digits represent text and which of them are markup information. Various ways of doing this have been tried, but the one now most widely adopted involves "tagging” the text. |
|
This is an <italics on>important<italics off> concept. it will render "This is an” into typeset letters for transfer to the printed page, but when it sees the < it will look ahead to find the next >. Then it takes whatever comes between the < and the > as an instruction and acts on it, instead of treating it as text to be printed in its own right. Here the typesetting machine would switch to italic type and carry on rendering the word "important” in italics until it hits the next <, when it would reads the instruction to turn off italics and acts on it, and so on. |
|
As people became used to marking up texts via such tags for processing by typesetting computers, they reflected on the fact that the visual ("presentational”) features they were telling the typesetting machine to include were there because they reflected aspects of the text’s structure. Which led to the idea that instead of the tags telling the typesetting machine directly what to do ("procedural markup”), they could just as easily say what the structural role of the tagged text is ("structural markup") and that would be an important gain for anyone who wanted to use computers for more than simply converting text into type. This was the crucial thing that moved markup from the domain of the publisher's back-room into the forefront of scholarly activity. This is an essential point, and since many academics seem so far to have failed to grasp its implications for their work, it is worth belabouring it by a more extended, though still somewhat artificially simplified, example. Let's imagine we have a typescript like this. |
Tedium
A novel
by
A.N. Other
Chapter 1
A Day at the Zoo
It was a day like any other. Except it was not. It was strangely
different.
"Is the sky always green in Worthing?" asked Ronona.
"I’m not too sure, dear," her mother replied, wondering secretly
whether Nigel had remembered to grease the mangle.
[...]
Chapter 2
What happened next
For A long time, nobody said anything.
Then the lights went back on.
[...]
|
|
The publisher’s editor, possibly in consultation with a designer, may decide the printed version ought to look like this: |
|
by A.N. Other A Day at the Zoo |
|
“I’m not too sure, dear,” her mother replied, wondering secretly whether Nigel had remembered to grease the mangle.[...]
|
|
Chapter 2
What happened next
|
|
|
|
Let’s see first how such a text might be prepared for a typesetting machine using the older style of "procedural" or "presentational" markup. The compositor keys in the author’s original text, and inserts tags that control how it will look when printed, following hand-written markup indicators placed on the typescript by the publisher's editor. (Note that here we introduce the convention that a tag beginning with a slash, like </this> "turns off" the effect of a tag like <this>. In a real-world example the markup would be more complex and use terser tag names.) |
|
[...] <style = "bold" size = "14pt">Chapter 2</style><style = "bold" size ="12pt">A What Happened Next</style> <style = "normal" size ="11pt"> For a long time, nobody said anything. Then the lights went back on. </style> [...] </Times-Roman> |
|
|
|
<title>Tedium</title> |
|
<para> [...] |
|
<chapter
number = "2"> <title>What
Happened Next</title> |
|
|
|
|
</chapter> </book> |
|
title =
Times-Roman 16 point bold all caps and so on. Equipped with such a style-sheet, our typesetting computer can now take the structural markup and produce exactly the same typography as it did from the earlier, presentational markup. It reads the name of each structural tag, looks up the desired presentational style for the stylesheet, and applies the style to the text between the opening and closing tags in question.
|
|
Separating structure and presentation: the benefits What advantage does that have?
First, purely from the publisher's point of view, separating structural
information from direct typesetting instructions means it is very easy to
change the appearance of this text, and any like it. Imagine this was one
of a large series of novels, and the publishers decide that for a new
edition they want a completely new look, with different layout and
typefaces. With the earlier, procedural type of markup, a huge number of
tags would need to be altered in the files that are used to generate each
book in the series, a time-consuming and error-prone operation. With
structural markup, however, all that would need altering is the one style
sheet that translates structure into typesetting instructions. The actual
files used to produce the books themselves could remain untouched.
Human readers looking at a book draw on a complex, though generally unarticulated, repertoire of cultural knowledge that tells them which parts are its title, its chapter headings, its paragraphs, its footnotes and its index . But the computer, unless we give it extra help, sees only the stream of digits representing symbols and white space. With procedural markup alone, all the computer can "know" is what size and style of type we want and where we wish to put it on the page, which means it is powerless to help us analyse the text in any useful way. Once we supply structural markup, however, the computer can recognise, say, chapter titles, for what they are by examining the element names in the opening and closing tags. So if we ask the computer to create a table of contents for us, it can scan the elements, identify the chapter headings and create a list of contents automatically. Or if we ask which words the author has used in which chapters and how often, the computer can tell us, and even manage to leave words within headings out of consideration so as not to falsify the tally. If you compare the two examples of markup above, you will notice a number of other useful pieces of information about the text that are visible in the element names but absent altogether from the presentational tagging, even in this very simple markup scheme. Of course if we had the time, motivation and resources to tag every single word with its part of speech, as has been done for the 4.000 odd texts in the British National Corpus (resulting in over 1.6 million tags) the amount of linguistic information the computer could extract for us would be enormous. The "meta-information" we add to the text need not be linguistic, however. We can mark-up our documents so that editorial variants, manuscript abbreviations or lacunae, and commentary of whatever nature or complexity we find appropriate, is embedded in our documents in a form that computers can recognise for what it is and handle appropriately to our needs. |
|
|
|
The success of the WWW showed users that the incompatibilities between computer systems which manufacturers had often claimed were inevitable, or conducive to competitive innovation, or both, were in fact needless barriers that given the will could be relatively easily overcome to the general benefit: and it showed hardware and software manufacturers that data interchangeability, far from damaging their commercial interests, actually fostered a huge increase in the use of computers and so in the sales of their products. |
|
|
|
XML versus HTML: language and meta-language HTML is a purely and simply a language for marking up documents so that they can become part of the World Wide Web: to achieve that, it offers a combination of a defined vocabulary with a specific syntax. Both vocabulary and syntax are laid down by an authoritative, though hardly authoritarian, body, the World Wide Web Consortium (W3C), in freely available documents. To use HTML for its specified purpose (and it has no other) you have to learn its vocabulary and respect its syntax rules (or else employ a "user friendly" system that encapsulates the requisite knowledge and makes sure you obey the rules, even if it frees you from the need to be fully aware of what they are). Hence the problems that arose when people wanted to enhance the appearance of their Web sites and expand their offerings by expressing things in HTML for which the existing language lacked either the vocabulary or the grammar. Because HTML has a predetermined set of element names with predefined meanings which can only be used in sequences or combinations specified in the language standard, if you want to do something which the language doesn't currently allow for you have to opt for one of three possibilities, all equally undesirable and none of them in any case feasible unless you have the resources of AOL/Netscape, Microsoft or the like behind you::
<title>King of England</title> If you find that in an HTML document which obeys the rules of that language, you can immediately know quite a lot about what the <title> element means and what its place is in the overall pattern of the document. It designates the title of an HTML page, that is to say, the text most browsers show in the outer border of their window, or which a search engine may use to label a page that matches a query. You also know that there should be only one part of the document tagged in this way, since the HTML rules allow for only a single <title> element per document; and you would know something about the location of the line within the document, because the HTML syntax rules specify that the element named <title>, if present at all, must be contained within the <head> element of an HTML document, not within its <body>. So, if someone misunderstood the meaning of the <title> element in HTML and tried to place it in the main part of their page to indicate a title that was supposed to display prominently at the head of their body text, they would not see what they expected (unless their browser was very lax indeed in its interpretation of the HTML rules). But if exactly the same line occurs in an XML file, none of that necessarily applies. With nothing but the line alone to go on, there is no way of saying what <title> means or what it tells us about the nature or function of the text it encloses. It could indeed be meant to play the same part, possibly even with the same context constraints, as its HTML namesake. But it could with equal likelihood indicate the title of, say, the biography of a monarch in a bibliographical listing; or it could be designating, not the title of either a page or a book, but one of a number of offices held by a given person, so that we might find it along with <title>Elector of Hanover</title> in a text dealing with King George I. That doesn't of course signify that an element named title can "mean anything at all" in XML: it merely emphasises that what it means (and where it can and cannot occur) in a particular XML document is a matter that was decided by whoever devised the specific "application" of XML which the document embodies and specified its "vocabulary" 2. To understand the meaning and role of the element, we have to know more about the "application" of XML that the document exemplifies. |
|
Explaining and constraining XML markup: the DTD It is perfectly possible, with short documents using a simple markup scheme, to rely on the XML markup itself to make its own meaning and structural principles plain. Supposing the immediate context of the line we considered earlier was something like this: |
|
<book ISBN="1-234-56789-2"> <title>King of England</title> <subtitle>A Monarch's Destiny</subtitle> <author>Geraint Marvin</author> <publisher>Royalist Books</publisher> <price currency="USD">39.90</price> </book> |
|
But the freedom which XML gives to devise and name
elements as we wish can lead to another hazard, alongside possible
unclarity of meaning: it can allow us to be inconsistent in our own use
of our chosen vocabulary, creating problems when we want to access our
information and exchange it with others. The ideal here is to exploit
the power which XML's flexibility gives us (letting us mark up our texts
to suit our perceptions of their structure and meaning) while also keeping
ourselves in line and our data in order (by ensuring that having once
defined a set of element names and rules for their use, we subsequently employ
them consistently and correctly in terms of the principles we have
ourselves laid down). The technique (currently at any rate) for
constraining ourselves within such self-imposed bounds is the use of a DTD, or
Document Type Definition. |
|
|
|
Michael Beddow
|
|
|
|
|
|
|
|
|
|
A third possibility has recently been opened up by the ready availability of browsers that have reasonably full implementations of the W3C XML DOM, allowing extensive client-side manipulation of the displayed HTML in real time using standards-compliant scripting, without any browser or platform-specific programming being needed. (Though IE6 retains many proprietary features, it is not hugely difficult to write DOM manipulation routines that work with both IE6 and with more fully standards-conformant browers.) This means that aspects of TEI-conformant markup that have up to now been impossible to render in freely-available browsers (e.g. markup of thematic annotations that cut across the basic encoding hierarchy of the document) can, via XSLT transformations and real-time DOM manipulation, be visualised and explored interactively using standard browsers and current technologies. The major obstacle to this potentially revolutionary advance in Humanities computing is the allegiance of many campus computer centres and "experts" in WWW authorship to the now seriously inadequate 4.7 series of Netscape browsers. Netscape itself, long prior to its demise, did not take this view, as witness the fact that Netscape 6 made no attempt whatever to retain backward compatibility with Netscape 4.7 for precisely those features and bugs in NS4.7 which its academic partisans claim it is essential to support. A further threat to the full exploitation of these technologies comes from the carelessness and complacency that has filled the world with Internet-connected systems vulnerable to remote attack and the mindless destructiveness of those who are bent on exploiting this vulnerability. As a (pretty ineffectual) defence against these evils, some individuals and institutions are disabling the ability of their browsers to execute programs, and so making full interactivity difficult to achieve. BACK |