It is not easy to comment such a wide but not yet exhaustively explored domain such as Digital Humanities: which aspect to start with? From which perspective and to which extent? To make the jump, I refer here to the text of a recent research work about multilingual digital corpora presented to the TEI conference and members meeting in Rome on October 2013. This is one of the topics I try to explore in the realm of my interest into the standardization processes in the digital world. It comes across other research concerns like multilingual information systems and data processing, e-Learning standards and digital content development.
Digital humanities are very common today in France although they emerged primarily (some decades before) in the US with a lead from the Alliance of Digital Humanities Organizations (ADHO) and a collaboration of other institutions like The European Association for Digital Humanities (EADH), Association for Computers and the Humanities (ACH)...
In France, this research domain is very recent. The first ThatCamp was organized in 2010 and issued a manifest after which a large-scale activity was launched to drain many academic and research institutions. Following this phenomena, we, in Bordeaux Montaigne University, tried to create a Digital humanities institute to follow-up innovation in this research field. This project however did not come to an end but a local task force on HD was created and is working now on reproducing that innovative research field in educational curricula. A master's program is to be confirmed soon and a research trend is already validated by the doctoral school of the university.
My own concern is to work on a particular aspect of this subject: creating multilingual digital corpora in humanities using the Text Encoding Initiative standard. This is connected with previous research interests like digital multilingualism, multilingual metadata systems and ICT standards. From that perspective we, as a group of international researchers, could raise funds for a one year research projects about modeling a TEI tagging system for a multilingual textual corpora in French, Arabic and Berber languages. That was a valuable experience permitting to discover a still “fashionable” research topic in France. Here is the resumé of the project as we presented it to the TEI International Consortium held in Rome on October 2013.
Promoting the linguistic diversity of TEI in the Maghreb and the Arab region
Henri Hudrisier, Rachid Zghibi, Sihem Zghidi, Mokhtar Ben Henda
Since many centuries, the Maghreb region is experiencing significant linguistic hybridization that slowly impacts on its cultural heritage. Besides Libyan, Latin and Ottoman contributions, significant other amounts of resources in various cultures and languages have been accumulated in the Maghreb region, either derived from classical Arabic (i.e. regional dialects) or from various dialects of Berber (i.e. Kabyle). Several resources are even composed simultaneously in several common or restricted languages (literary Arabic, colloquial Arabic, French, English, Berber) like newspapers, “city printing”, advertising media, popular literature, tales, manuals for learning languages, etc. These resources are often written in a hybrid script mixing both classical and vernacular Arabic, or combining transliteration forms between Latin, Arabic and Tifinagh (traditional Berber script). Unlike many traditional textual resources (conventional printed documents and medieval manuscripts), it does not exist today vast corpora of texts in vernacular idioms and scripts. But our hypothesis is that the growing awareness of the diversity of these textual resources would rapidly result in an exponential increase of the number of researchers interested in collecting and studying classical old texts and oral resources. The standard TEI encoding format provides in this respect a unique opportunity to optimize these resources by ensuring their integration into the international cultural heritage and their use with maximum technical flexibility. The "HumanitéDigitMaghreb" project, which is the subject of this intervention, intents to address several aspects of theses research objectives and to initiate their appropriation.
The project targets both oral corpus and the rich text resources written in the Maghreb region. It focuses particularly on the continuity, for more than 12 centuries, of a classical still alive Arabic language and on the extreme hybridization of vernacular languages sustained by the rich Libyan, Roman, Hebrew and Ottoman influences and by the more recent French, Spanish and Italian linguistic interference. In short, the Maghreb is a place of extremely abundant, but much unexploited, textual studies.
Our project permits comparative visions to understand how to transform TEI originally designed for classical and modern European languages (Latin, medieval languages, etc. ...) in order to work on corpora in literary Arabic and in mixed languages and scripts. For example, how researchers from the Maghreb, who invest in the French metric study and fully understand the TEI markup, can understand the subtlety of Arabic meter markup? How do they develop and give examples, when possible, of markup terminological equivalents of metric description in English, French and Arabic? How can they see if there are really specific «Arabic» structural concepts and then provide the appropriate tags for them. These questions can concern “manuscripts”, “critical apparatus”, “performance text", etc...? For “TEI speech”, we assume, however, that it is not really likely to be the specific method to apply although much work remains to be done. Doing this, we are aware that researches on similar adaptations are undertaken in other languages and cultures: Korean, Chinese, Japanese ... Theses adaptations and appropriations of the TEI experiences are of high interst for us.
As a starting point, we consider that the use of TEI in the Maghreb and the Middle East is still sporadic and unrelated. The existing work is mainly concentrated on the study of manuscripts and rare books. This focus can be explained primarily by the existence of large collections of Oriental manuscripts in western digital collections that are TEI encoded since a long time. It can also be explained by the urgency felt within the Arab cultural institutions to accelerate the preservation of cultural heritage from deterioration. Thus, we assume that TEI relatively profited from all experiences and projects for encoding Arabic manuscripts. However, this effort seemingly still needs a larger amount of feedbacks of other nature, generated from other types of resources with other forms of complexity (mainly linguistic and structural). The question that drives us here is to know how the complexity of that cultural heritage (that of the Maghreb as much as we are concerned) would be of any contribution to TEI? How to define its cultural and technological distinctiveness compared to the actual TEI-P5 and what are the solutions?
In the project "HumanitéDigitMaghreb", we particularly focus on the methods of implementing the TEI to address specific complex structures of multilingual corpus. We achieved some results, but on the long term, we especially concentrate on practical and prospective issues of very large standardized and linguistically structured corpora that will allow, for all linguistic communities (and we concentrate here on the Maghreb world), to constitute appropriate references in order to interact correctly with translation technologies and e-semantics in the future. On this last point, it is essential that the community of Arab and Berber researchers mobilize without delay to provide these languages (both written and oral) with their digital modernity. Three steps are to be taken in this respect:
1. The first step, which is beyond the limits of our project "HumanitéDigitMaghreb", inevitably involves a linguistic and sociocultural analysis of the Arabic context in order to clarify three points: first, how the TEI, in its current and future versions, would encode the Arab cultural heritage; second, how the Arabic context surpasses the limits of one level of standard cataloging (MARC, ISBD, AACR2, Dublin Core) ; and third, how it succeeds to standardize the different approaches of its heritage scholarly reading.
In its constant evolution, and the need to strengthen its internationalization, the TEI community would undoubtedly profit from these cultural and linguistic characteristics. This would require also that this community be well organized to provide adequate encoding standardized formats for a wide range of linguistically-heterogeneous textual data. We can imagine here the encoding needs of electronic texts in Arabic dialects profoundly scattered with transliterated incises or written in different characters. These texts are potentially very complex. Besides connecting these materials to each other, like in parallel data (often bilingual), there are further levels of complexity inherent to the use of character sets and multiple non-standard transcription systems (different from the International Phonetic Alphabet) and related to the need of transcribing the speech in an overwhelmingly oral society, which poses interesting encoding problems.
2. The second step, which is under the scope of our proposal, is to produce TEI standard references in local languages and to introduce them to academic and professional communities. These standards help address issues of specific linguistic complexity like hybridization of digital resources (local dialects) and preservation of a millenary oral and artistic heritage. Thus, the issue of character sets is not without consequence to represent local dialects, in large part because many of their cultural aspects were not taken into account in the development of existing standards (transcribing numbers and symbols, some forms of ligatures, diplomatic and former alphabets). There are, for example, many properties of the Arabic or Berber languages, as the tonal properties, regional synonymy and classical vocalization, (notarial writing) that require special treatment. Current standards, in particular the Unicode and furthermore ISO 8859 standards, do not take into account many of these aspects.
3. The third step, in which we are also engaged, is the creation of a community of practice specialized in the treatment of specific resources. We note here that most of these resources are potentially complex and certain features require probably specific markup arrangements. This means that a dynamic environment is required to specify the encoding of these documents - an environment in which it is easy to encode simple structures, but where more complex structures can be also encoded. Therefore, it is important to have specifications that can be easily extended when new and interesting features are identified.
We are interested in TEI not only for its collegial dynamics open on non-European linguistic diversity (Japan, China, Korea…), but also for its eclectic research disciplines (literature, manuscripts, oral corpus, research in arts, linguistics...) and its rigor to maintain, enrich and document open guidelines on diversity ensuring at the same time the interoperability of all produced resources.
The results of our work are reflected through a website that lists a collection of TEI encoded samples of resources in areas such as music, Arabic poetry, Kabyle storytelling and oral corpus. To achieve this, we went through a fairly rapid first phase of TEI guidelines appropriation. The second phase would be a larger spreading of the TEI guidelines among a wider community of users including graduate students and mostly scholars not yet convinced of the TEI added-value in the Maghreb region. Those could be specialists of Arabic poetry, specialists of the Berber language, musicologists, storytelling specialists... The translation of the TEI P5 in French and Arabic, but also the development of a sample corpus and the construction of TEI multilingual terminology or glossary in English/French/Arabic, seems very necessary.
We also intend to propose research activities within other communities acting at national and regional levels in order to be in total synergy with the international dynamics of TEI. We have been yet involved in an international project, the “Bibliothèque Numérique Franco-Berbère” aimed at producing Franco-Berber digital resources with a funding from the French speaking International organization. In short, by getting engaged in the school of thought of Digital Humanities and TEI, we explicitly intend to give not only a tangible and digital reality to our work, but we try to make it easily cumulative, upgradable and exchangeable worldwide. More specifically, we expect that our work be easily exchangeable between us and our three Maghreb partner languages (Arabic, French, Berber) beside English.
Apart from the emerging issue of management and setting a standardized and interoperable digital heritage, it is obvious that specialists in this literary heritage should largely explore the methods of study and cataloging. Therefore, this article is limited to discuss only questions of scholars and professionals (libraries and research centers) appropriation of digital humanities tools and services in the Oriental context. We will focus, among other issues, on compared cultural problems by facing European ancient manuscripts study to the Arabic cultural context.
- ABBÈS R. (2000). “Encodage des corpus de textes arabes en conformité à la TEI, outils et démarche technique“. Rapport final de projet DIINAR-MBC.
- Bauden F., Cortese Delia Ismaili, and other (2002). Arabic Manuscripts. A Descriptive Catalogue of Manuscripts in the Library of The Institute of Ismaili Studies.
- Burnard, L. (2012). “Encoder l’oral en TEI : démarches, avantages, défis…. Présenté à Conférence à la Bibliothèque Nationale de France, Paris: Abigaël Pesses.
- Guesdon, Marie-Genviève (2008). “Bibliothèque nationale de France: Manuscripts catalogue ‘Archives et manuscrits’”. Paper presented at the Fourth Islamic Manuscript Conference, Cambridge
- Hall, G. (2011). Oxford & Cambridge Islamic manuscripts catalogue online. http://www.jisc.ac.uk/whatwedo/programmes/digitisation/islamdigi/islamoxbridge.aspx
- Henshaw, C.(2010). "The Wellcome Arabic Manuscript Cataloguing Partnership", in:
- News in brief, D-Lib Magazine, March/Apri. http://www.dlib.org/dlib/march10/03inbrief.html
- Ide, N. (1996). “Representation schemes for language data: the Text Encoding Initiative and its potential impact for encoding African languages”. In CARI’96
- Ide, N. M., & Véronis, J. (1995). Text Encoding Initiative: Background and Contexts. Springer.
- Jungen, C. (2012). “Quand le texte se fait matière”. Terrain, n° 59(2), 104‑119.
- Mohammed Ourabah, S., & Hassoun, M. (2012). “A TEI P5 Manuscript Description Adaptation for Cataloguing Digitized Arabic Manuscripts”. Journal of the Text Encoding Initiative,
- Pierazzo, E. (2010). “On the Arabic ENRICH schema”. Wellcome Library Blog, 27 August, http://wellcomelibrary.blogspot.com/2010/08/guest-post-elena-pierazzo-on-arabic.html
- Véronis, J. (2000). Parallel Text Processing: Alignment and Use of Translation Corpora. Springer.