COVID/pandemic-related contributions of the COST Action CA 18209 – European network for Web-centred linguistic data science (“NexusLinguarum”)
The COST Action “NexusLinguarum” (https://nexuslinguarum.eu/) is dedicated to linguistic data science and the contributions of partners of the Action we are reporting on are therefore all related to the creation of language resources that aim at supporting the analysis of pandemic-related documents by a series of applications based on language technologies (LTs). We list and describe shortly in this short report some contributions made by members of the COST Action.
Image by courtesy of https://pixabay.com/vectors/coronavirus-globe-flags-world-5018466/
Lexicography
An important area of linguistic data science is given by the building of a set of comprehensive multilingual (electronic) lexicographic datasets that can support several applications making use of language technologies (LTs). A burning topic in the field of (electronic) lexicography is the ability to discover and comprehensively describe new words and expressions in actual text productions, of all types, from news, scientific reports, to social media and the like. In the case of the Coronavirus/Covid-19 situation, this task of automatic discovery and lexicographic interpretation of neologisms is particularly challenging, as it is not confined to a specific domain (health), but is concerning also diverse societal, political, educational, and legal aspects (and more).
- Therefore, a member of NexusLinguarum, and member of various associations in the field of lexicography, has co-organized a workshop on neologisms dedicated to the current pandemic. https://globalex2021.globalex.link/ (Globalex Workshop on Lexicography and Neology: Focus on Coronavirus-related Neologisms (GWLN 2021, @AUSTRALEX 2021)) is giving more information about this event.
- A working group of NexusLinguarum is dealing with a use case on public health, based on parliamentary data about Covid-19, as well as with an analysis of metaphors in relation to the epidemics.
- Two other members of NexusLinguarum are developing a COVID-19 collaborative glossary: https://clunl.fcsh.unl.pt/en/investigacao/projetos-curso/glossario-colaborativo-covid-19/
- Another member of the NexusLinguarum was involved in developing a bilingual thesaurus (French-English) by using a term extraction process from papers about SARS-CoV-2 and other coronavirus, like SARSS-CoV and MERS-CoV. The dataset is available at LoterreThésaurus Loterre – COVID-19. This is giving us the opportunity to also stress the importance of providing domain-specific terminologies, in addition to lexicographic resources.
Corpora
Another type of language resources developed and/or updated are the linguistically and semantically annotated collections of relevant texts on a topic, so for example for the Coronavirus/Covid-19 situation. The task of annotation is performed at the beginning by humans, who are experts either in the linguistic field or in the domain under consideration. Based on a relevant set of manually annotated documents, the annotation task can be automatized and controlled/curated by the experts. Such corpora are used for training several models that can be applied for solving different types of language processing tasks. Examples of such tasks are Machine Translation (MT), Information Extraction (IE), Information Retrieval (IR), or Question-Answering (Q&A). For some tasks (e.g. IE or IR), the linguistic annotation of large sets of documents (the corpora) is not enough, and it requires a domain specific annotation (or marking) of the documents that are used for training the models underlying the systems performing the tasks. And as such applications need to be tested, against manually annotated test datasets, the task of corpus annotation is not only a central one, but also requires a lot of careful work.
A member of NexusLinguarum is centrally involved in a European initiative: the MLIA Challenge: http://eval.covid19-mlia.eu/. There, many documents related to various aspects of the pandemic have been annotated with information that is supporting three (multilingual) tasks: MT, IE and IR. On the basis of such annotation work, MLIA invited research institutions and commercial entities to train their models in one or more of those LT tasks and to apply their model to test datasets, so that a comparison of the performances of the different systems can be performed. A second round of this challenge is on the way. All results are published and can be accessed on the web page mentioned above.
Several corpora supporting the IE task have been developed by another member of NexusLinguarum (together with associated national partners), which we list here:
- Corpus PubMed – COVID-19
- Corpus PubMed – Coronaviruses broadly (historical and current literature)
- https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.18167/DVN1/MSLEFC
- https://covid-nma.com/dataviz/
All these datasets have been collected in order to support IE and other LT tasks.
Other Activities
In the context of a NexusLinguarum working group dealing with social sciences and how linguistic data science could contribute to this field, work has been pursued in analysing, discussing, or contributing to related initiatives, like:
- GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany. GESIS Panel Team (GESIS Leibniz-Institut für Sozialwissenschaften), see https://www.gesis.org/gesis-panel/coronavirus-outbreak
- Three rounds of European Parliament COVID-19 Survey (https://datacatalogue.cessda.eu/detail?q=%22GESIS__oai:dbk.gesis.org:DBK/ZA7736%22), (https://datacatalogue.cessda.eu/detail?q=%22GESIS__oai:dbk.gesis.org:DBK/ZA7737%22)
(https://datacatalogue.cessda.eu/detail?q=%22GESIS__oai:dbk.gesis.org:DBK/ZA7738%22