Corpora in scientific research

In recent years, the exponential intensification of digitization lead to considerable growth in research activity around the construction and analysis of corpus. Such phenomena predicts a conceptual and epistemological renewal in scientific research. However, if corpora appear today as constitutive of almost all research activity in Humanities and Social Sciences, their conception, treatment, analysis and valorization modes vary according to the disciplines and the contexts in which they are created, treated and exploited. Researchers and communities of practice all tend to take ownership of them using a large variety of methods and tools.

In fact, the question of corpora refers to a series of questions relating to their typologies, but also to the conditioning of the methodological paradigms related to their construction and their analytical modalities. Ideally, a corpus is a set of product samples (body of studies) designed to be representative of a domain (existing body of work) or a subdomain (reference body) through rigorous and methodical selection. This measured construction of representative samples constitutes a "discursive space" made up of a limited set of elements (statements) that "constitute sources of evidence that are quite valid" (Stark, 2014).

The question of corpus also refers to an epistemological debate related to the researcher's positioning, his involvement or distance from his body of study and his methodological choices, to the articulation of his analysis results with reality. The question of corpora, as a process of access to knowledge, will thus find itself in the middle of a long-standing scientific debate that has opposed several currents of thought such as positivism and its ontological, deterministic and dualistic hypotheses (cause and effect ) of an objective reality, or the sociolinguistic tendency (first "variationist", then interactive and now anti-objectifying), up to the modern constructivist trends that "introduces a new, more tangible, viable relationship between knowledge and reality". (Von Glasersfeld, E., 1994).

From a constructivist standpoint that considers the corpus object as a construction dependent on the researcher but also on the subject and the research environment, the research methodology "depends on the ability of the researcher to adapt his analysis to the results, and to become aware of this dependence between method and results "(Mucchielli, 2006). This raises the question of the “scientificity” of the actor-researcher subjectivity and experience. How can a researcher in humanities appropriate the corpus object, from the "collection" to the analysis, using a methodology? The corpus would thus be more in the field of hermeneutics as a methodological parameter defined by an objective of reading and analysis that can feed an interdisciplinary reflection henceforth unavoidable. Indeed, the constructivist tendency is based on the principle that all disciplines are mutually inspiring by requiring the researcher to reconsider his prejudices, his methods, and his points of view, even to create a new approach, without however overflowing the scope of his source discipline. .

In the humanities research fields, now crossed by an growing interdisciplinarity, there are nevertheless points of convergence and forms of complementarity that make "shared problems emerge then on the practices of encoding information, on the structuring, dissemination and archiving of corpora "(Marin Dacos and Pierre Mounier, 2014). The varied scientific density of the corpora is now part of a constructivist methodology that assumes that all disciplines, by inspiring each other, force the researcher to reconsider his prejudices, his knowledge, his methodological knowledge, and even to create new skills to fit into interdisciplinarity.

Encoding and corpus markup standards are emerging, along with new technological tools that specialize in the management of collections of textual and multimodal resources in their collection, markup and annotation phases as well as in their referencing, indexing and search. Transcending the entire range of traditional processes of electronic document management (EDM) based on the principles of document referencing, the constitution and organization of digital corpora, as a constructed object, addresses the profound substance of the document as object.

It thus becomes the privileged way to construct meaning by integrating, among other things, the technologies of linguistic engineering, semantic analysis and pattern recognition. Enriched by technological advances related to Web 3.0 and semantic networks, including domain ontologies and shared knowledge organization systems (SKOS), corpora are quickly the subject of dedicated standards proposed by research communities now registered in the current of the Digital Humanities. The TEI consortium, for example, proposes the TEI standard (Text Encoding Initiative) which gave rise to the MEI (Music Encoding Initiative) and the CES (Corpus Encoding Standard). The normative effort of the corpora reaches its highest level with the International Organization for Standardization (ISO), which in 2016 publishes the ISO 24624: 2016 standard for the transcription of annotated corpora of audio and video recordings of spoken interactions.