|
This was originally written as a background paper to accompany an oral presentation at the Colloquium zur Zukunft der historischen Lexikographie am Beispiel des Dictionnaire étymologique de l'ancien français, Internationales Wissenschaftsforum Heidelberg, June 2001. There have been no subsequent attempts to update all the details of how the approach to AND digitisation has changed since this paper was written, but the broad outline remains accurate enough. Some of the references were, however, brought up to date in mid 2003, including those that indicated the progress of the digitisation project at that time, so that in a number of places what were orignally declarations of intent have been altered into accounts of tasks completed. However, this document will in due course be supplemented by a longer one that provides an outline case study for the entire Dictionary digitisation project, as finally completed early in 2006.
|
|
The nature of the task |
|
Type 1, the most frequent but also the most variable in length and structure, is exemplified in its simplest form by the following: |
|
EXAMPLE 1 belloculus s. a precious stone: Belloculus est une pere, Si trait a blanchur sa manere Lapid 215.347.
|
|
We have a headword in bold type on a line of its own. On the next line comes an abbreviated part of speech indicator (here s. for substantive or "noun") followed by the remainder of the entry data, which can very greatly in complexity and length but always contains at least one English translation or gloss, formatted in italics and terminated by a colon, and. in nearly every case, at least one citation from a source to exemplify usage of the headword. These citations are in a continuous sequence, with items separated by semi-colons2 and a full stop to terminate the sequence. Within each citation, the short title reference is formatted in italics, small capitals, or a combination of the two. |
|
Our aim is to go from the Microsoft Word version of the above, to an xml
version something like this3: |
EXAMPLE 2
<entry type="main" key ="belloculus"
id="AND-201-41B2AE38-B61FB35E-4419704A-566CB427">
<form type="lemma">
<orth>belloculus</orth>
</form>
<gramGrp>
<pos>s.</pos>
</gramGrp>
<sense>
<trans>a precious stone</trans>
<eg>
<cit>
<quote>
Belloculus est une pere,
Si trait a blanchur sa manere
</quote>
<bibl siglum="Lapid" loc="215.347">
<title rend="italic">Lapid</title> 215.347
</bibl>
</cit>
</eg>
</sense>
</entry>
|
|
|
|
Why convert to xml? Before looking at how appropriate software can take us surprisingly painlessly (at least where such a simple entry is involved) from the likes of example 1 to something approximating TEI-conformant xml as found in example 2, it is advisable to address the doubts of those who can't see why it's worth bothering to do so in the first place. Human readers can look at the printed text of the original entry above and easily see what its various parts are and what they mean. Or at least they can if they bring to the task quite a large body of previous knowledge of what bilingual dictionaries are for and how they are generally used. Lacking that knowledge, they would probably need instructions, either from more experienced users, or from the introductory apparatus of the printed work, but once that knowledge is theirs they can use it to interpret visual clues on the page which tell them which parts of the text mean what. They realise that the practice in this dictionary is to set translations in italics, and are aware of the general scholarly convention to italicise the titles of works referred to or their abbreviated forms, so they have no difficulty in picking out the translation and the cited short title of the citation source, Lapid, and distinguishing between them, even though they are both italicised. Similarly, they recognise that the switch to italics in the citation section marks the end of the citation proper and the start of its reference, even though that important transition is not marked in any other way, and they are not confused by the fact that the transition from normal to italic typeface earlier in the entry, between the part of speech marker and the translation, has a completely different significance. When the editors were creating this entry in Word, they turned italics off and on at appropriate places so these and similar markers would be visible to the end user on the printed page. As they typed, the editors knew what these transitions meant; and once formatting commands in the Word document have reproduced the transitions on paper or on screen, knowledgeable readers understand what they mean as well. So thanks to the repertoire of cultural knowledge that the editors and their intended readers have in common, quite complex distinctions have been conveyed, even though they were not fully explicit (and in some cases not even present) in the medium (the Word file) that mediated them. We might be content enough with that, if we were prepared to treat an electronic version of the dictionary as analogous to a microfiche edition, one that maybe takes less space and is cheaper to reproduce, but is often more trouble to use than a printed work But that entails a regrettable renunciation of all the additional things, many of them of great scholarly value, we can do with the electronic data if only we can store it in a way that allows the computer to "see" and act upon things that human readers can discern only by drawing on knowledge that computers do not, and probably never will, possess. That is principally what xml markup, in the context of lexicographical scholarship, is all about. |
|
A closer look at the markup To turn to the xml in example 2, the first thing we notice is that the entire entry sits between opening <entry> and closing </entry> tags that identify it precisely as an entry. Those opening and closing tags, together with all that they contain, constitute a single "parent" element, within which other, "child", elements are grouped. That points to an important feature of xml markup. Although it is primarily meant to make documents more readily and fully "understandable" to machines, it does so, when properly designed, in a way that human readers also find comprehensible, if somewhat verbose. The element and attribute names indicate what the components of the data are, and their grouping and containment indicates that (and often also how) they are related to one another. Not that the average human reader is meant ever to see the "raw" xml under normal circumstances (we shall see shortly how text marked up in xml can be presented on paper or on screen like any other document). But if, for example, documents marked up in xml have to be handled by someone other than their original authors, it is generally possible to work out how to do so from the markup itself, even if there is no accompanying documentation. This is the informal meaning of the technical (and in its stricter senses rather problematic) notion that xml data is "self-describing". |
|
The key attribute has a number of uses, but an explanation of one of them should be enough to give an indication if what sort of thing it is for. In dictionaries that draw on manuscript sources, or which describe languages where grammar and spelling are highly unstable, there is often no final certainty about the precise form of a given word, either because the manuscript evidence is partly illegible or ambiguously abbreviated, or because the actual "headword" form that we expect to see in a dictionary (in Anglo-Norman, for example, the infinitive of a verb) has never been found in an extant manuscript. In such cases, the form that appears in the headword is the outcome of an editorial best guess or informed conjecture, and needs to be indicated as such. One frequently adopted convention for indicating a conjectural form, used also in AND2, is to enclose it in square brackets, eg [baissere]. Now our human reader, scanning an alphabetical list of headwords, knows full well that the first "letter" of this word is b, not [. But our computer knows no such thing. Unless we explicitly tell it otherwise, it will see every conjectural form in the dictionary as starting with the same "letter" [. Consequently, it will sort them all contiguously, and if asked to find all words whose first letter is b will not locate the entry for [baissere] at all. It would be possible to correct this by programming an application so that it treated square brackets in a special way whenever it encountered them, but there are two problems with that. First, it means the data would be fully usable only by a specialised program that "knows" about square brackets and their significance in lexicography, and it is precisely that dependence on encapsulated intelligence in specialised programs (rather that explicit information in generic data) that xml is meant to reduce. Secondly, recognising and handling special cases takes up processing time and resources, and with the Anglo-Norman Hub envisaged as containing many millions of words, we have better things for our processors to be doing. Hence one function of the key attribute: in the majority of cases, it is identical to the main headword, but wherever that headword contains symbols or characters that would peturb sorting or retrieval, the key attribute is given a value that omits the troublesome elements (so that the entry element for [baissere] has the attribute setting key="baissere"). When looking for alphabetical sequences or retrieving entries via their headwords (though this is not the main retrieval technique actually used) the computer examines the value of the key attribute instead of looking at the headword form itself. Last among the attributes of the entry element in this particular example, we encounter probably the most important (and fearsome-looking) of them all: the id. Identifier attributes are extremely important in xml applications. They are generally given the name id, but it's important to remember that simply using that name for an attribute does not magically turn it into an identifier, a fact which even seasoned xml practitioners forget at least once a year (what it actually takes to create a true identifier attribute is beyond the scope of this article). To grasp the importance of identifiers, it may be useful to reflect on their significance in all areas of data processing, whether or not xml is involved. If your name is Aloysius Crankhandle and you ring your electricity company to query your last bill, you may be understandably annoyed if the Customer Satisfaction Consultant at the call centre insists she can't get your records on screen unless you supply your 32-digit customer number. If you were called John Smith, however, you might be more inclined to accept that computer systems regard personal names as too unreliable a way of identifying people before taking cash from their bank accounts. Long experience has led to a standard technique by which computers recognise entities whose everyday name does not uniquely establish their identity, whether those entities be electricity consumers or the shifting forms of Anglo-Norman vocabulary. A mechanical procedure is used to generate a unique code for each item to be identified, which must then be linked to that item in some reliable way. It's the second part that causes such grief and resentment against computer systems where human identities are concerned. Cans of beans can have their identity bar-coded at the factory, but Aloysius Crankhandle must find his reading glasses and struggle to read the small print on his bill, though John Smith will doubtless have long ago realised that he needs to know his account numbers by heart before placing a call. The values of identifier attributes in xml do not have to be numbers, indeed they aren't allowed to have a decimal digit as their first character, and they can be any common-or-garden word, as long as they uniquely identify the element to which they belong. It's that requirement for uniqueness that creates a problem for an ongoing project like the AND2 where we often have to make cross references from within an entry to items which don't yet have an entry of their own, let alone an existing id attribute. The solution adopted when creating identifiers for AND2 entries (and for several other elements in the markup system which also have identifiers, but which can't be dealt with here) is to take the actual text of the main headword, append some further ingredients which ensure that, for example, homographs are distinguished from one another, and then feed it into a mathematical procedure that creates numbers from text in a way that is guaranteed to fulfil two conditions: 1) it will always produce the same output number for the same input text; 2) it will never produce the same output number for two different texts. This gives us both our main requirements: a unique identifier for each item, plus a reliable way finding the identifier, given appropriate information about the item, hence our ability to generate with confidence identifiers for items yet to be entered. So the initially formidable AND-201-41B2AE38-B61FB35E-4419704A-566CB427 loses most of its power to terrify: it is simply the 128-bit binary number that the procedure produced when fed with the headword and some context identification, rewritten as four groups of eight-digit hexadecimal numbers, and given a prefix that, among other things, indicates the file in which it is to be found. This is an entirely automatic process, using widely available software routines that employ well-known methods. This means other sites and projects could, if they wish, create identifiers for items in the Hub and use them to gain direct access to those items over the network, a topic that will be touched on again in the concluding part of this paper. |
|
Finally, in this short entry at least, we find an example of how this particular sense of the word is used or where it is found, contained within an eg element (more complex entries have several of these for each sense). The example here consists (as in all AND entries) of an actual citation from a documentary source. Citations have (at least) two structural components, the text actually quoted, and the source where it is to be found, and these are duly contained in the quote and bibl elements respectively. Readers who have followed the drift of this discussion may realise that the present contents of the bibl elements have a further implicit structure which our current encoding does not fully distinguish. At the moment, the AND2 markup uses the single bibl element to contain both the title of the source and the page or other reference that locates the cited item within that source, even those these can, and for the longer-term purposes of the ANH should, be distinguished (as indeed they are in the non-standard siglum and loc attributes we currently place on the bibl elements4). This part of the markup will be replaced, in due course, either by a more differentiated set of bibliographical data within the cit elements, or by a pointer to such data elsewhere in the master document. Comparing the two versions, it should be apparent how much more a computer can "know" about this entry when it has been converted into xml. Supposing we want to see all citations from this particular source (Anglo-Norman Lapidaries, ed. P. Studer and J. Evans, Paris, 1924.) that have been used to exemplify vocabulary. The computer can immediately scan the entries, pick out only the bibl elements within cit elements, and show us all the cits from the source identified as Lapid. Until the access restrictions of the pre-publication phase are lifted, general users can not execute such a query on-line for themselves using the "live" data. But they may like to view this sample page, which shows the output of such a query performed by an authorised user. You might counter that much the same could be done on the original version using Word's search facility. But then you could only search for the string "Lapid" across the whole of each and every entry. You might be able to eliminate false positives from possible instances of this form in other contexts by telling Word to report only those instances that were formatted in italics, but even then it might pick up other italicised instances of the string which were used to give the source simply of a variant spelling, and which therefore had no associated citation. And even where Word did find Lapid in the right place, it would still lack any means of automatically isolating and presenting (or maybe further processing) the precise citation to which the reference applies, because the boundaries of that citation are not marked in any form that Word can recognise. Or again, we might want to locate and study all instances of the Anglo-Norman word-form "trait" within citations. No problem for an xml-aware system, which can see from the markup exactly where citations begin and end. [Again, general users cannot yet access this facility, but there is another sample page which shows the outcome.] But a big problem for Word, which would present a user with each and every occurrence of the string "trait", including maybe the English homograph which might occur within a translation text or an editorial contextual qualifier, and so not be what we were looking for at all. For instance, the dictionary entry under headword element contains the gloss "trait (of character)", which was (quite correctly) not retrieved by our example search, since the processor can recognise it from the markup as an English form. And again, with Word, even where the right form was located in the right context, the software has no means of delineating that context and presenting it without extraneous appendages to the user, who has to rely on the human ability to spot the desired boundaries on the screen. The ability to constrain searches to particular sections of an entry as identified by the markup has many other uses. For instance, it allows us to use the intrinsically mono-directional entries (Anglo-Norman to modern English) in the reverse direction, by searching for modern English words (only) in the translation elements and displaying the Anglo-Norman headwords to which they belong. The handicap, for present-day users, of the dictionary's inevitably slow progression through the alphabet can also to some degree be circumvented. If a user encounters a term or expression in an Anglo-Norman source that lies beyond the alphabetical range of the dictionary's advance to date, it is always worthwhile searching for it in the citations for the published letters, where it may well occur in an expression illustrating a different headword, in a context that allows its meaning to be inferred. |
|
Performing the conversion That should be enough to convince most readers that converting the dictionary to xml markup of this kind was a Good Thing in principle (and those who have doubts based on the appearance of the xml version and wonder how it could possibly be used in a normal way will soon have those doubts removed). But there were inevitably questions about whether, after so much of the dictionary had already been written up in Word format (over 10,000 substantive entries in the revised A-E, producing around 1,500 sides of quite closely-packed two-columnar A4 printout), it was either feasible or desirable to undertake the conversion, which at best might distract editorial attention from the primary task of compiling new entries and enhancing old ones, and at worst might lead to the introduction of errors into entries already scrupulously checked for accuracy and so best left untouched. These objections were understandable, and call to mind one expansion of the acronym SGML, the name of xml's venerable, expensive and abstruse parent. In the later 1980's and early 90's, before Webmania swept across the world and brought xml in its wake, academics in the Humanities were occasionally treated to eloquent presentations of the benefits of SGML markup for scholarship from roving evangelists, many wearing TEI colours. And the audience, as it applauded politely, could be heard to mutter: "Sounds Great! Maybe Later..." The good news, for AND2 at any rate, is that largely automatic translation from Microsoft Word format to xml markup of the kind illustrated above proved perfectly feasible without any recourse to expensive or recondite software, and that the process involved, far from corrupting the text, has actually helped the editors to improve the consistency and accuracy of the entries still further. That does not mean that it would be sensible to decide to create new dictionary entries of any complexity in a word processor if they were intended to be encoded in XML. From F onwards, where the revision is being begun as part of the digitisation process, all AND2 entries will be created in XML from the start, using appropriate software, in the same way that maintenance and correction of the batch converted A-E entries is now done using XML editors. But given the legacy of thousands of entries already completed in Word format, there was no alternative to semi-automatic conversion. What made it possible to accomplish that conversion in a relatively short time (under two years, with no-one working anything approaching full time on this particular aspect of the ANH project) were three interconnected recent developments . 1. The astonishingly rapid and widespread development, acceptance and deployment of xml and its ever-burgeoning related technologies, including XSLT processors for transforming documents from one form of xml into any other, as well as into html for WWW deployment and the source formats of typesetting machines. This has been accompanied by the ready availability of advice and support for both new and experienced users of xml technologies on the WWW from a range of sources including private enthusiasts, non-profit organisations like the Apache Foundation, the OASIS Consortium which publishes Robin Cover's voluminous and continually updated xml pages as well as Microsoft, IBM, Sun and Oracle. This means that xml, although much younger than SGML and theoretically less powerful and flexible, has already proved itself a much greater force for getting things done in every area of Humanities computing as well as in business and industry. SGML required tools that were sometimes extremely expensive and invariably recondite, and its creators and advocates were in general neither skilled nor particularly interested in popular exposition. Working with xml is far from easy, but the necessary tools, knowledge and advice are frelly and abundantly there for the asking for anyone willing to invest the necessary time and effort to learn their use and think through their potential. This was never the case with SGML. 2. The development of the Perl language, already the uniquely powerful and flexible tool of choice for the manipulation of textual data, to incoproate full support for Unicode, and its extension by specialist modules to create and manipulate xml in a wide variety of ways, especially where the conversion of unstructured text to complex markup is required.. 3. The opening up by Sun Microsystems of the source code to the Star Office suite, and the continuing development, also supported by Sun, of OpenOffice into a set of programs, with publicly available source code, that use xml as their native format for storing data, and incorporate tools for importing other data formats, including the most recent Microsoft Word formats, and converting them to xml.
|
|
OpenOffice Writer, from the Open Office Foundation, can read native Microsoft Word Word documents, and save them in OpenOffice's own native format, which is well-formed xml that complies with a published specification. Indeed in the latest releases of Open Office, it is relatively straightforward to incorporate third-party filters into Open Office itself which allow documents created in Word to be saved straight from Open Office into files using any xml vocabulary which can be created algorithmically from Open Office's own xml, but this particular feature came only after the conversion of A-E was complete, and its application is consequently not described here). If we load the Word Document containing the AND2 letter B into OpenOffice Writer and immediately save it in Writer 6.0 format, the resulting file (or rather file collection: an OpenOffice native format document is a zipped set of xml files) contains the following section which represents the OpenOffice representation of the document fragment shown in the first example (the opening and closing lines of the whole document have been left intact, but some 99% of its 2 megabytes of content have been excised at the points now marked by blank lines.)
|
EXAMPLE 3
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content
xmlns:office="http://openoffice.org/2000/office"
xmlns:style="http://openoffice.org/2000/style"
xmlns:text="http://openoffice.org/2000/text"
xmlns:table="http://openoffice.org/2000/table"
xmlns:draw="http://openoffice.org/2000/drawing"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:number="http://openoffice.org/2000/datastyle"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:chart="http://openoffice.org/2000/chart"
xmlns:dr3d="http://openoffice.org/2000/dr3d"
xmlns:math="http://www.w3.org/1998/Math/MathML"
xmlns:form="http://openoffice.org/2000/form"
xmlns:script="http://openoffice.org/2000/script"
office:class="text" office:version="0.9">
<office:body>
<text:h text:style-name="P3" text:level="1">belloculus</text:h>
<text:p text:style-name="P2">
<text:span text:style-name="T2">s.<text:tab-stop/></text:span>
<text:span text:style-name="T3">a</text:span>
<text:span text:style-name="T3">precious</text:span>
<text:span text:style-name="T3">stone</text:span>
<text:span text:style-name="T2">
: Belloculus est une pere, Si trait a blanchur sa manere
</text:span>
<text:span text:style-name="T3">Lapid</text:span>
<text:span text:style-name="T2"> 215.347.</text:span>
</text:p>
</office:body> </office:document-content>
|
|
We have an element text:h which encloses precisely the material that will be our form element(s), followed by a text:p element whose contents are exactly those that will be incorporated into the remaining contents of our entry element. So if we wrap the former in a <HEAD> element and the latter in a <BODY> element, then wrap both these in an entry element, the outer structure we want will be in place. (The <HEAD> and the <BODY> elements, as their capitalised names help to indicate, are not destined to find their way into the final TEI-based markup: they are merely intermediate devices to help us get closer to that final markup, and will be discarded once they have fulfilled that role). Within the text:p element, the text is marked up in a series of text:span elements, each of which has a name attribute. If we compare these names to the appearance of the text in our Word example, we realise that name value T3 corresponds in this instance to italic text, whereas name value T2 marks out normal text. (This specific correspondence is particular to this file, but the mechanisms behind it are fully explained in the OpenOffice documentation and it it easy to write progam code that links these names to the formats they represent). So we have the basics of what we need to pick out both the translation text and the reference title from their surroundings and mark up their structural role. If we look at the first text:span element whose content is simply the "s." of the part-of-speech indicator, we notice something else extremely useful. The tab-stop that follows the part-of-speech marker has been marked up in its due place as a text:tab-stop element. Since we know that the part-of-speech marker is bounded on the one side by the end of the headword line (here marked out by the closing tag of the text:h element) and on the other by this tab stop, we have all we need to pick out the part-of-speech indicator and in due course mark it up as a pos element within our gramGrp. There is still more structural information waiting to be gleaned and made explicit. We know that the translation text starts after the tab stop. And it ends where the italics end, which OpenOffice has already shown in the markup it has provided. (The reason why the single italicised translation phrase is marked up by the OpenOffice xml creation routine as three separate italicised-text elements has to do with complexities of the original Word format which cannot be explored here.) So we (or more to the point, conversion programs we can write using freely-available tools) have enough information to create our sense element and within it the trans element. Because we know where our translation text ends and the citation source begins, we can also isolate for markup the text of the citation proper. And what comes between the end of the italicised source title and the closing tag of the text:p element is none other than the citation page reference. In other words, in this simple instance the OpenOffice xml version of our irignal Word file expresses and articulates all the information we need to create the markup we want to apply as in example 2. |
|
XSLT can be a difficult language to learn and use, especially for people who know only the "procedural" style of programming, where you basically tell the computer what to do and in what sequence to do it. XSLT is, by contrast, a declarative language, where you provide the computer with a set of scenarios to be applied as and when it encounters particular conditions, without the programmer knowing or caring which of those conditions will in fact be met and in what order. But it is learnable by anyone prepared to take a flexible attitude and get to grips with features that, to a procedural way of thinking, sometimes seem designed to make easy things as difficult as possible (the opposite of the philosophy behind the Perl language). And no great XSLT expertise is required to write a program (generally called a "stylesheet" in XSLT circles) that will take example 3 as input and provide us as output with example 4. (As already indicated, the latest versions of Open Office allow us to incorporate such XSLT into Open Office itself and write out our modified XML like that in example 4 directly from that program, with no need ever to see the internal Open Office format displayed in example 3.) To repeat the pont about the use of upper-case element names here: this is merely a convention to enable elements belonging only to this intermediate xml format to be distinguished at a glance from their OpenOffice generated predecessors and their eventual TEI-inspired successors. |
|
EXAMPLE 4 <ENTRY> <FORM>belloculus</FORM> <BODY> s.<TAB/><ITALICS>a precious stone</ITALICS> : Belloculus est une pere, Si trait a blanchur sa manere <ITALICS>Lapid</ITALICS> 215.347. </BODY> </ENTRY> |
|
|
|
Displaying the entries The history of the TEI project and its reception contains frequent instances where academics (some of them on the TEI's own expert panels) have complained that SGML markup gets in the way of true scholarship because it hides the "real" text under a welter of tags. Such complaints would apply a fortiori to xml versions of TEI-compliant markup, since xml is in some places more verbose (because more explicit) than SGML. Anyone comparing example 1 to example 2 may feel those objections have some force. But that misses the point. Markup in SGML or xml is not meant to be used "raw" by anyone apart from those actually performing or checking the markup operation itself (and even in that respect, readily-affordable progams are now emerging which allow markup to be applied and edited while hiding its literal embodiment more "user-friendly" displays). A fair comparison at this level would require us to substitute for example 1 the actual internal storage format that Word uses to represent the formatted text in memory or on disk. That would not be a pretty (or a comprehensible) sight to the reader unaccustomed to binary storage techniques, but then no such reader would ever want to see the internal representation anyway. Example 2 is there simply to illustrate what the conversion techniques do, and to show the kind of markup which allows the searching, cross-referencing and retrieval facilities central to AND2 and ANH to operate. There is no intention of inflicting a compulsory backstage tour upon everyday users of the Dictionary or the Hub. Instead, what a user sees who requests the entry for belloculus via a browser is this: |
|
EXAMPLE 5
belloculus
s. a precious stone: Belloculus est une pere, Si trait a blanchur sa manere Lapid 215.347.
|
|
The XSLT language already used to simplify example 3 into example 4 is put to intensive use on the AND server to transform the xml data in which the dictionary and all the other ANH documents are stored into html as and when users request information. In all likelihood, within a year or two, web browsers will be able to understand xml and display it in a user-friendly form. Both Internet Explorer, Mozilla, and until its recent abrupt demise, Netscape, have advanced towards this goal on different paths and to different degrees, but none of them has got there yet. So for the time being at least, documents marked up in xml have to be re-written into the html that current browsers understand before sending them out. In the case of this entry, this involves rewriting the xml of example 2 to produce html like this:
|
EXAMPLE 6
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Anglo-Norman Dictionary</title>
<link rel="stylesheet" type="text/css" href="/css/and1.css">
</head>
<body>
<a name="AND-201-41B2AE38-B61FB35E-4419704A-566CB427"></a>
<div class="headword-section">
<spanclass="headword">belloculus</span>
</div>
<div class="body-section">
<span class="pos">s. </span>
<span class="trans">a precious stone</span>:
<span class="quote">
Belloculus est une pere,
Si trait a blanchur sa manere
</span>
<span class="source-whole"><i>Lapid</i>215.347</span>.
</div>
</body>
</html>
|
|
|
|
Although some AND2 entries are a few thousand words long, a considerable proportion contain less than a hundred words. Obviously, the editors do not want to maintain each entry as a separate file: apart from anything else, this would make them extremely difficult to keep track of on their computers. So all the entries (and indeed many other documents and document components belonging to the AND project as well) are stored as a single virtual document on the project server. ("Virtual" here means that the software that accesses and manages this collection of data "sees" a single entity: at a lower level, this entity is stored in a number of separate physical units, but that is a matter only for the low-level system maintainers, and is hidden both from the editors and from the servers that index, search and deliver the data to end users), The virtual document containing the xml for letters A-E is some 27 Megabytes in size and contains over a million lines of text. This is of course far too large to be loaded to a user's browser in one piece, even over a fast network link, and if it was being transferred across a modem connection, most systems would eventually give up, though probably not before the user had tired of waiting and cancelled the request. In any case, users who wanted simply to know what belloculus means would probably object to having their curiosity rewarded by a mammoth download of every other lexical item in the AND2's current repertoire. So ideally we would like the user to be able to request and obtain just the specified entry.. Unfortunately, standard WWW documents written in html, and the http protocol used to transmit them, have no means of allowing this. An html document must be transferred from the server to the client in its entirety, or not at all. This highly inconvenient limitation is not removed by the provision of so-called fragment identifiers in html, because these do not work in the way most users (and indeed some Web authors) assume. Imagine I have a very large Web page, called, say, bigpage.htm. I can edit it in such a way that each section within it is given a distinctive name in the html markup. Let's suppose those names are smallpart01 to smallpart20. If I now want, from another document, maybe even on another server, to refer a client to one of those smaller sections, I can do so by putting into my referring document a link like http://www.myserver.com/bigpage.htm#smallpart18, where the part after the # symbol is called a "fragment identifier". When a user selects such a link, what will eventually appear on screen is indeed the portion of bigpage.htm that was labelled smallpart18. But that conceals the fact that the whole of bigpage.htm has in fact been transferred. When you click on a link like http://www.myserver.com/bigpage.htm#smallpart18, your browser doesn't in fact tell the server that you've requested a fragment. It simply chops off the url at the # symbol, and asks the server for the whole of bigpage.htm. When bigpage.htm comes in from the server, your browser searches it for the name smallpart18 and moves your display to the appropriate part instead of leaving you at the top of the document. So using standard WWW techniques, the only way to avoid transferring large documents where only a part is really wanted is for the document creators to divide their original material up into a (maybe very large) number of separate documents, and have the clients ask for them one by one. (Some sites adopt a more sophisticated variant of this by storing their sub-documents in a database and generating html from the retreived database contents on demand; but, as with the pre-creation of smaller html fragments, there still has to be a batch process to populate the database and so end users are not being served directly with the xml that the editors have created). The World Wide Web Consortium became aware that this was a problem, so as part of the xml proposals, they have specified a way in which, in a future pure xml system, a browser will be able to pass a fragment identifier on to the server, which will then send send back just the requested section of a larger document (this is part of what are known as the XLink and XPointer proposals). The AND server has already implemented a way of doing precisely that, but one that works today, with existing browsers. The AND2 files are all large xml documents. When you ask for an entry, our server finds the part of the xml document you want, extracts it from the larger file, converts it into html, and sends it back to you. This means that the editors can maintain files of the size that best suits their scholarly purposes, while you can receive files of a size that is convenient for transmission even over a busy or slow network. It also means that any changes or additions the editors may make to their master files are immediately reflected in the web pages sent out to clients (since those pages are drawn in real time from the master xml files): there is no need to wait for periodical batch updates of a massive html fileset, or to maintain and manage a separate database that mirrors the content of the master xml.5 |
|
Structure and presentation separation on the ANH site The typeface changes and punctuation items which have been left out of the xml, and then seemingly magically restored for the purposes of display, have one decisive thing in common: they are simply devices to help the reader distinguish between parts of the entry, not part of the data itself. (The Normans may have been noted for their forceful ways, but they didn't speak in bold type; nor do the editors of the AND, even if their manner is occasionally oblique, habitually speak English in italics). Our conversion to xml has encoded the identity and role of the various parts of the entry in different, more explicit ways, so the typographical and punctuation-based clues are redundant and need not be stored. However, when we come to display the entry, since the structure and role of every part of it is explicitly encoded in the xml, we can easily tell the computer to recode this information via typeface changes and additional punctuation as it creates the display. As it happens, we have chosen to use for our on-screen display more or less the same markers as in the print edition, but we need not have done so, and if we wish, we can alter the way the computer formats these elements at the moment our output is written, without any changes whatsoever to our core xml data files. One good reason for doing this would be if our server detects that the client is using a browser which cannot cope properly with some aspect of our standard display. For example, users who for one reason or another decide on a text-based browser such as Lynx, cannot normally see italics on their screens. The default behaviour of Lynx is to display italics as bold, but we might decide instead that it would be less confusing for such users if our normally italicised text appeared in a different colour. In that case, when we see a client using Lynx (we can also generally "see" whether our client is a PC, a Mac or a Unix workstation) we could replace the html italics on and off codes in the output with instructions to alter the text colour. 6 But separation of content from presentation allows us to be even more flexible than this. The XSLT language, which as already indicated is what is doing the translation from xml back into an html webpage for us, is capable of doing much more than just alter text formatting and insert punctuation: it can, if we wish, include or exclude any parts of the xml we choose from the output page, as well as rearranging the sequence of those parts in whatever way we like. Supposing we discover from user responses that sometimes the wealth of citations offered by AND2 actually hinders the work of certain users, who would prefer just to see headwords, parts of speech, grammatical information and a translation. We can easily accommodate those users, again without any change whatsoever to our underlying data. We simply allow them to check a "no citations" box on the request page, and when our server sees that it will, for those users alone, strip out the citations before sending the entries back. Which brings us to perhaps the most exciting possibility which our separation of presentation and content allows. If we wish, once more without any interference whatsoever with our data in itself, we can make any of our entries resemble in typography, layout, content and colour scheme a typical entry from any other dictionary for which our server has a layout specification. Obviously, where another dictionary has information of a type we do not record (syllable division, etymology etc.), we can't create it for our entries ex nihilo. But any type of data the AND2 does have in common with another online dictionary could be displayed by our server in that dictionary's characteristic format and layout instead of the AND's native style. This would be a useful thing to do, not because the AND2 editors have any desire to pass off the work of colleagues elsewhere as their own, but because of the way scholars might want to work with a whole family of dictionaries relating to medieval languages. Of course, it will be wonderful to be able to open up the AND2 on line, then go to maybe to the DEAF site and open that in another window, perhaps adding, as the occasion arises, a window on to the digital FEW on one's desktop; but anticipated delight at such a prospect should not trigger the vice of premature self-congratulation, of which I shall say more in my last section. Because it would surely be better still if instead of having to juggle between three different sites and three different sets of commands and query languages and display layouts, we could start from any site of our choice and run queries against the other dictionaries from within that site, seeing the result presented, as far as possible, in a uniform format via a single interface? The Head of User Support at a major University once took to calling me "Professor Cloud-Cuckoo-Land" (I had made the outrageous suggestion that at least one of the PC's on campus should be equipped with a mouse: this was in 1986) I shudder to think what he would call me if he heard my latest vision of computing bliss. And yet, what I have just outlined is perfectly possible, and the ability to do it (if other sites are willing to do their part) is being built into AND2 from the very start. |
|
Anyone who looks at the xml sites of the big corporations listed above (or observes the dominant themes in Robin Cover's daily reports of xml developments) will notice the predominance of B2B (or Business to Business) among the topics. This should not send tender-souled Humanities scholars scurrying away in horror. B2B technology is about enabling enterprises, who all have their own, often proprietary and mutually incompatible systems of data storage, to trade with one another electronically. And xml is widely accepted as the key to doing that (which is why people with xml expertise and experience are remorselessly drawn away from university campuses and into the private sector where money is thrown at them). But substitute "share and exchange knowledge for the benefit of scholarship" for "trade with one another" and you have in essence the rationale of the founders of the TEI, conceived a good decade before big business smelled the coffee. But now that business has indeed woken up to the value of seamless data interchange, Humanities scholars seem to have dozed off again. There are, of course, major and by now long-established TEI projects worldwide, but they tend to be more or less self-contained in practice, if not in principle. There are also many projects that claim conformity to the TEI Guidelines, but which in fact have merely borrowed tagsets from the TEI proposals, then used them in a way that makes true interchangeability extremely difficult. This is just one manifestation of a common and extremely unfortunate vice among Humanities computing people to which I alluded a little while ago: premature self-congratulation. When my son was small, I once took him to a fish-farm where, In exchange for the entrance fee, he was given a bag of fishfood pellets. Before I could say anything, he'd tipped his entire bag into one of the large tanks outside the main entrance, and was chuckling with glee at the flurry of gobbling mouths that immediately surfaced. But then, once inside, he watched in increasing dismay as other kids, who had marshalled their resources more prudently, were able to summon the denizens of all the many other different tanks to the surface. "But I thought that was it, Daddy!" he explained, on the verge of tears. Too many Humanities scholars type an article into Word Perfect 5 for DOS or edit their home page in Dreamweaver and think that is it. And too many Humanities Computing Advisers sustain them in that delusion. As a result, huge opportunities are being missed, and substantial investments, often of taxpayers' money, in Humanities computing projects yield much less benefit to scholarship than they ought to. Hence my plea to all those on the threshold of digitising their dictionaries: don't just aim to get the material on to CDROM and/or the WWW somehow or other, leaving the decision on that somehow to "experts" whose word you take as Gospel ( or, worse still, to commercial vendors who want to lock your data into their systems, even if they offer you a bargain price for doing so) and then think that is it. Keep your data free from dependence on any proprietary storage, retrieval or delivery systems, and above all, build the possibility of active data interchange with other related projects into your planning from the start, and make sure your technical advisers don't get away with telling you such ambitions are too difficult, too expensive, or "technically premature", because in the age of xml that simply isn't true. Having made that plea, it is not my place to lay down the law about how
any lexicographical digitisation project should proceed in any detail. What
I can do, with the backing of the AND2 editors, is give a set of firm
undertakings about how the ANH and AND2 projects will proceed, in the hope
that at least some of these commitments will be considered, and maybe
emulated, by other projects. |
|
1. All the materials in the ANH project, dictionary, bibliographies and corpora, will be marked up in xml in compliance with TEI guidelines. 2. The DTDs by which this is achieved, once finalised, will be made publicly and freely available. 3. Every key part of the ANH/AND2 digitisation will be performed and delivered to users exclusively via Open Source software. More specifically:
4. The ANH/AND2 will provide a publicly documented interface to its on-line resources, which will enable other on-line projects to link into those resources at whatever level they find appropriate, both for editorial purposes and to allow end-users to move seamlessly into and out of the AND materials from within the on-line offerings of other related projects. |
|
|
|
1 The first edition of the Anglo-Norman Dictionary (AND1) was
published in 7 fascicles between 1977 and 1992 by The Modern Humanities
Research Association (MHRA). Since 1990 the Editors have been revising
and greatly expanding letters A-E, as the first stage in a complete new
edition (AND2). Much new material has become available and been
incorporated, major layout changes have been introduced, and a corpus of
some 3.5 million words constructed which is now the principal basis of
ongoing lexicographical research. Concordancing software allows the editors
to comb the material for new lexical items and, especially, collocations and
phrases, in which the revised A-E is now uniquely rich among cognate
dictionaries. This and its coverage of non-literary registers make AND a
crucial tool for scholarly understanding of medieval French, the evolution
of English, and the history, society and culture of medieval England. For an account of AND2 from an editorial and scholarly perspective, see D Trotter, "L'Avenir de la lexicographie Anglo-Normande: vers une refonte de l'Anglo-Norman-Dictionary?", Revue de Linguistique Romane, 64 (2000), 391-405. BACK |
|
|
|
|
|
| |
|
|
|
|