The Action

NexusLinguarum is a COST Action that currently counts on participants from 42 countries (39 COST Countries, 1 Near Neighbour Country, and 2 International Partner Countries). So far, 239 members have joined the different working groups (WGs), a number that has been steadily growing since the network’s inception

What the Action does

The main aim of the Action is to promote synergies across Europe between linguists, computer scientists, terminologists, and other stakeholders in industry and society, in order to investigate and extend the area of linguistic data science. We understand linguistic data science as a subfield of the emerging “data science”, which focuses on the systematic analysis and study of the structure and properties of data at a large scale, along with methods and techniques to extract new knowledge and insights from it. Linguistic data science is a specific case, which is concerned with providing a formal basis to the analysis, representation, integration and exploitation of language data (syntax, morphology, lexicon, etc.). In fact, the specificities of linguistic data are an aspect largely unexplored so far in a big data context.

In order to support the study of linguistic data science in the most efficient and productive way, the construction of a mature holistic ecosystem of multilingual and semantically interoperable linguistic data is required at Web scale. Such an ecosystem, unavailable today, is needed to foster the systematic cross-lingual discovery, exploration, exploitation, extension, curation and quality control of linguistic data. We argue that linked data (LD) technologies, in combination with natural language processing (NLP) techniques and multilingual language resources (LRs) (bilingual dictionaries, multilingual corpora, terminologies, etc.), have the potential to enable such an ecosystem that will allow for transparent information flow across linguistic data sources in multiple languages, by addressing the semantic interoperability problem.

More info about the Action

In the recent years, the LR community has progressed in making LRs available on the Web, defining joint strategies among European institutions and industry, and creating new spaces for cooperation that have incentivised the creation and sharing of LRs. Past projects have created an ecosystem of complementary yet isolated closed and open access LRs in heterogeneous formats that require the use of many APIs and services for querying them. Their discovery is hindered by the fact that they are described in various overlapping catalogues and reside in various repositories with different metadata schemas. Further, the reuse and combination of these LRs by third party applications is often costly because of the variety of formats and semantic interoperability issues. In order to reduce such issues, the application of linked data techniques to LRs has led to the emergence of the so-called Linguistic Linked Open Data (LLOD).

LLOD grounds on linked data to share and interlink linguistically relevant data sources. Linked data technology is in a mature state now and it is being increasingly adopted by industry and public institutions worldwide (e.g., national libraries, museums, media companies, and public administrations, among others). On the other hand, the benefits of sharing linguistic data on the Web in a semantically interoperable manner has been recognized by the LRs community, which has shown increasing interest in publishing linguistic data and metadata as linked data on the Web. As a result of interlinking a number of open monolingual and multilingual LRs, the so called LLOD cloud (http://linguistic-lod.org/llod-cloud).

Despite the fact that the term “data science” has been around for decades, it has not been popularised until recently. Data science takes advantage of mature fields such as statistics, data mining, or machine learning. In particular, the recent success of deep learning is also promising for both the generation and exploitation of linguistic linked data. Through linguistic data science, we can more deeply understand the nature of language in new ways by putting in place formal methods for the representation, integration and comparison of language data. Further, owing to the fact that language is the way in which human knowledge is usually transmitted, linguistic data science has the potential of deeply changing studies in fields that are making wide use of natural languages for sharing knowledge, like it is the case in the humanities (e.g. for understanding literature in new ways or to predict and analyse social trends), legal studies, journalism, social sciences, among many others.