NexusLinguarum Working Group 3: An Overview
Structured language data are important in a substantial number of research fields, however, generally costly to manually curate, dynamically update and represent in an interoperable manner. Working Group 3 (WG3) investigates different types of support for linguistic data science, ranging from data analytics techniques at a large scale and neural methods to linked-data aware NLP techniques in combination with Linguistic Linked Open Data (LLOD). This manifests itself in the following specific tasks:
- Task 3.1: Big Data and linguistic information
- Task 3.2: Deep learning and neural approaches for linguistic data
- Task 3.3: Linking structured multilingual language data across linguistic description levels
- Task 3.4: Multidimensional linguistic data
- Task 3.5: Education in linguistic data science
Before detailing the activities, initiatives, and outcomes of the individual task-related topics, we would like to highlight the fact that none of these would have been possible without the very active and strong support of individual task leaders. Apart from being internationally renowned and supporting each task with their expertise, the readiness to constantly support this Action (CA18209) and all its initiatives should not only be acknowledged but gratefully celebrated. Thus, we would like to deeply thank Dimitar Trajanov, Radovan Garabik, Thierry Declerck, Ineke Schuurman, Renato Rocha Souza, and Rute Costa for their substantial contributions and strong support of this working group and its topics – without them none of this would have been possible.
The intersection of Big Data and structured knowledge has to this point focused on Knowledge Graphs rather than linguistic data. To address the specific benefits that can be gathered from utilizing Big Data techniques and platforms for LLOD processing, from storage to generation, we have prepared and submitted a position paper. This submission includes use cases such as to more efficiently identify links in very large linguistic graph data, access corpora and diachronic information in a performant manner, extract information from stream, dynamically updated linguistic data, and efficiently interact with linguistic knowledge graphs in combination with neural methods. Since this is a rather novel field, the provided use cases rather exemplify a mutually beneficial union than be exhaustive.
Our task on neural methods for linguistic data has been highly active, since it represents one of the hot topics in research at this moment. Apart from several workshops on this topic, including the workshop on “Deep Learning, Relation Extraction and Linguistic Data with a Case Study on BATS” at LDK 2023 in Vienna this September, it has led to several publications and initiatives on neural relation acquisition from large language models, including a very large translation activity of The Bigger Analogy Test Set (BATS) to approximately 16 European and non-European highly low-resource languages and corresponding experiments on relation acquisition from large language models. In addition, it is the main topic of the oncoming summer school of NexusLinguarum “5th Summer Datathon on Linguistic Linked Open Data” (SD-LLD 2023) in Croatia this June. This training event strongly focuses on seminars as well as hands-on sessions of LLOD and large language models, LLOD and Graph Neural Networks, triple verbalization, Knowledge Graph embeddings with a strong linguistic focus, among many more with international experts.
Linguistic data science especially within the context of LLOD has strongly focused on interoperability. However, one aspect that has not yet explicitly been evaluated is that of interoperability across different linguistic description levels, from lexicography and terminology to phonology, diachronic data, and pragmatic aspects. Thus, to evaluate the current state of this topic, we submitted a substantial and theoretically founded literature survey on existing approaches to address individual linguistic description levels as well as their interoperability/intersection in order to identify current gaps in the field with a very large number of expert authors. Our findings clearly show that specific levels, especially phonology and phonetics as well as pragmatics have been strongly underrepresented and that further intersectional approaches across linguistic description levels would be highly beneficial, using the high interoperability of LLOD platforms as a true opportunity for fundamental linguistics research.
The initial intuition on multidimensional linguistic data was to include a large number of dimensions, from time and space to language variations, diachronic aspects, and in general sociolinguistic interoperable approach. This ambitious initial intuition was challenging to operationalize, which is why the task was finally split into a task on time and space and multimodal LLOD representations. The latter is particularly interesting since a representation of sign language data was quickly identified as ideal use case that has ever since accumulated in a number of proposals for interoperability between different culture-specific sign languages and LLOD representations as well as the general proposal of how to represent sign language data in combination with already existing LLOD approaches as well as linking them to existing linguistic data from written language, potentially interesting for other use cases that wish to jointly represent videos, different lexical representations, and existing lexicons.
Within NexusLinguarum in general the topic of education in linguistic data science has taken on a variety of highly successful forms and initiatives. An initial preparation of existing and possibly related courses allowed us to establish our uniqueness, which then led to two separate but strongly related initiatives, in terms of people involved as well as topics covered. The first is that of preparing a Massive Open Online Course (MOOC) for linguistic data science to provide basic knowledge to anyone interested in the topic. The second one is that of proposing an international master’s program within the ERASMUS+ funding, which has already been started by being successfully awarded an Erasmus Mundus Design Measures (ERASMUS-EDU-2022-EMJM-DESIGN), targeted at financially supporting the preparation of an ERASMUS+ proposal. Thus, we expect to provide substantial educational initiatives for linguistic data sciences as an outcome of this Action.