This is the second part of the interview with philologist Daniele Fusi on the Transkribus platform and the frontiers of electronic text analysis. In the first one, Dr. Fusi told us about his experience as a digital philologist and some programs used in his research field.
In the first part of this interview, Dr. Fusi, you talked about a distinction between digitalized text and digital text. Can you explain it in more detail?
Starting from the distinction between digitized and digital text, digitizing basically means transferring data to a different medium. Even a photographic image of a page of a manuscript is a digital text; but its level of structuring is minimal, since from the point of view of the machine it is nothing more than a mosaic of dots. The path towards a digital text (or any other resource) leads to an ever-increasing level of structuring of data, which goes hand in hand with an ever-increasing vocation for interconnection.
When an original medium is available, the first step is actually to transcribe the text as such, i.e. a sequence of numerical codes representing no longer the points of an image but a sequence of characters. The distance between these two digital representations of the text is great; and here, in fact, proper frameworks such as Transkribus are applied, which make it possible to drastically reduce costs. Another complementary technology is in this area IIIF, whose application extends beyond simple text as it is aimed at annotation of any type of image, but which, given its growing popularity, is in fact also in the philological field as a very effective tool to display synchronously both the transcribed text and the image of its original support, along with all the paratextual information and graphic apparatus considered relevant.
The more the level of data structuring increases, the more the need to define models capable of giving an adequate and necessarily reduced representation of their complexity increases: we are not dealing with reality, but with its digital representation, necessarily simplified, a bit like the Platonic demiurge with respect to hyperuranium. The act of transcription alone implies an operation of interpretation that is often far from obvious (think of reading problems such as the dissolution of ambiguous or unclear abbreviations, gaps or other deterioration of the written text, etc.) and normalization (the orthographic one is the most evident); then think of the problems related to the constitution of a critical text, where the notion of text itself is an abstraction.
The path towards a digital “edition” of the text has in fact unfolded in these decades through a series of technologies and visions that also reflect the very evolution of information technology. When it is necessary to highlight this path to an audience of non-specialists, I often have recourse to the example of tools such as a modern dictionary; a text that is very familiar in the collective experience, and much less problematic because it is free from the problems of historical reconstruction. In a very simplistic way, if we consider how this tool has evolved from the paper volume to its countless digital incarnations today, we can identify a series of levels of digital structuring through the exemplification of the different technologies that have marked its recent history.
Level zero is here the paper volume. Its most obvious first digitalization is represented by the photographic images of its pages. This, which we can consider level 1, already makes it possible to expand its potential audience of readers by making these images available to them on the Internet, and to ensure the longevity of the dictionary even beyond the duration of its physical support. But its usefulness in digital terms is limited; even just finding a headword requires a procedure of access parallel to the paper one: there we open the volume in a random point, and start to scroll the pages, relying on the current titles, until we first find the page, and then the headword. In the digital field this procedure is even more laborious, both for the latency connected to the retrieval of the different images via the network, and for the greater difficulty in displaying an entire page in a screen of limited size.
Going up another step of this ideal scale, we can make a digital transcription of the text contained in your images. This is now common for any OCR program, or for much more complex HTR systems like Transkribus. In the case of a modern printed dictionary, an OCR typically produces a text document with formatting (rich text). If we imagine this document in RTF format, we have here a typographically marked text; that is, a text where some characters are reserved to represent metatextual data, in this case related to formatting.
In this level 2 the greater structuring of the textual data makes possible at least full-text searches; we can therefore search for a lemma, finding all the occurrences of the typed text without distinction. Maybe the lemma is found typographically highlighted, for example in bold; but since bold is only a typographical information applied to many other semantic roles, for example equivalents, we will continue to find false positives.
In such a text, the content of the document, in our case lexical information, is indissolubly merged with the typographical presentation chosen by the publisher; whereas an essential requirement of digital is the separation of the unique content from its innumerable presentations. The corresponding typographic marking is now too little, because it is ambiguous (e.g. bold for both lemma and equivalent), now too much, because it is redundant (e.g. margin or line spacing which normally have no relevance for determining the semantic role of a portion of text). Any change to the layout of the document also implies a change to its content, so that it is difficult or impossible to adapt the same dictionary to different media or audiences.
It is then possible to go up another step, and access a much less verbose and more abstract type of marking, which deals with the structure of a document rather than its appearance. For example, instead of specifying type, color and font style for the headword, it is limited to defining it as emphasis; it will then be up to the individual presentations to propose a diversified look (perhaps in red for the web, and in bold for B/W printing). This level 3 can be represented by HTML, a structural marking, designed to define the structure of a hypertext, with markers such as “title”, “paragraph”, “emphasis”, etc.; every detail related to the graphic design belongs to another complementary technology (CSS).
Of course this is a great advancement; the marking is much more streamlined, and the graphic design adaptable. Yet here, too, our search for a headword may encounter the same difficulties. The results still include false positives, because if headwords are in emphasis, any text with the same emphasis is included. What is needed is a specific lexicographic marking, i.e. one that explicitly says things like “lemma” or “equivalent” rather than things like “title” or “emphasis”.
The solution is found by ascending another step of this ideal scale, in XML’s semantic marking. XML is a marking “language” that includes countless “dialects”, each with its own lexicon and syntax, specialized for a given task. In the case of lexicography, we could then have specific markers for each semantic role. This level 4 is that of the well-known TEI, in fact the standard in the humanistic field to represent any document of historical interest, according to a very rich semantic marking, divided into various functional areas. TEI is an extremely effective standard because it is very wide, and capable of containing all the information related to a text within it, in a digital format that is basically nothing but text, to guarantee its longevity and interoperability.
Naturally, as for any research field, the availability of new tools makes it possible to move forward by building on the foundations of the progress made. To stay within our example, the most recent evolution in the lexicographic field has now acquired the integration with mutual advantage of corpora and dictionaries, and the concept of lexicographic database, where each word enters a complex network of relationships with all the others, and it is precisely this network that defines its nature. Rather than a long alphabetical list of voices discussing each word, one must imagine a graph made up of countless interconnected nodes forming a huge web of linguistic relations, such as to determine the semantic field of each word. This web of linguistic and conceptual relations finds perfect expression in the ontologies of the semantic web, with vocabularies such as SKOS or Lemon, allowing different projects to weave their networks forming an increasingly wide common fabric from which it becomes possible to draw new information.
This ideal level 5 on our scale of examples leads us a long way, where the same boundary between traditionally distinct works (monolingual and bilingual dictionaries, dictionaries of synonyms and antonyms, etymological dictionaries, specialist dictionaries, encyclopedic works, text corpora, etc.) tends to fade into a huge canvas of potentially infinitely expandable relationships; think, for example, of lexicographic databases such as WordNet, BabelNet, etc. One can therefore draw on a true representation of data, no longer necessarily forced into the cage of their original presentation.
Returning to texts of historical interest, a great possibility of expansion is precisely represented by their integration of conceptual resources of every kind and specialization. At the moment there are very valid tools to start introducing semantic markings of such an abstract level into texts presenting certain concepts. From simple microforms in traditional HTML pages to semantic identifiers embedded in XML markup, it becomes possible to connect the text to a selection of concepts that is part of this global shared heritage in continuous expansion, thanks to that “network effect” that has already sanctioned the success of the traditional web. This is what the Gramsci digital project already seems to me to be carrying on, where automatic and manual semantic annotations are widely used to build a dense network of philosophical, historical, literary, political, etc. concepts in their connection to that specific text, but also in relation to all the other concepts represented in the large Linked Open Data cloud. The text here becomes part of a much larger federated system of relationships, and consequently also the object of research no longer only formal, but also conceptual, or a combination of the two.
In such a perspective, highly structured digital texts tend to enter a much broader system of relationships, combining corpora and specific conceptual domains. The models of the latter, made independent by too tight mental and technological constraints, can be free to expand and form “iuxta propria principia”, and in turn relate to an even broader network of globally shared ontologies. This perspective is in my opinion one of the most fertile frontiers of digital editions, which in an apparently paradoxical way exalt even more the text itself by going beyond its boundaries. This is useful especially when the contents are intrinsically complex, or the result of systematic analysis of them (e.g. morphological or syntactic) enriches them to such an extent that it is impractical to contain them in all their complexity within the only structure constituted by marking, which may constitute a limit here.
We can therefore imagine a textual edition no longer in the terms of the simple presentation of its text, but as an articulated container, where textual, metatextual and non-textual data all enjoy the same citizenship, and enter fully into a potentially infinite network of relationships. It is a perspective that shifting the focus from presentation to data representation, which is one of the roots of the semantic web, can also lead to overcoming many of the practical and theoretical difficulties often encountered today by dealing with complex TEI-based markings, and which would take a long time to deal with.
However, once we embrace and integrate the different technologies and visions I have mentioned in this short and very partial example, it also becomes easier to integrate models and information in the text that simply could not find a place in the pure marked text, and also try to reduce the distance that today separates approaches such as close and distant reading. In the same way that in an image annotated in IIIF you can zoom from the overview to the most minute detail within the same data network, it may become possible for a text to pass from overviews produced by aggregations and statistical elaborations on huge corpora to single passages in texts of our interest, carrying out searches that freely combine patterns of various kinds (graphic, phonetic, syntactic, semantic…) to the more traditional formal parameters related to the search within a specific text. If it is far from easy to create tools capable of great overviews and highly specialized analysis of single steps, combining pick and chisel, it may be possible to start connecting data in a much wider network than the one in which so far the simple marking may have relegated them.
To be continued…