Workshop on Discourse studies and linguistic data science (DisLiDas)

Discourse studies and linguistic data science workshop (DisLiDas)

Meeting Dates: 24th May 2022

Venue: Jerusalem, Israel (& online)

Organizer Institution(s): Mozaika, Ltd., Sofia, Bulgaria

Local Organizer(s): Chaya Liebeskin, Jerusalem College of Technology, Jerusalem

Organizing committee: Purificação Silvano, Christian Chiarcos, Mariana Damova, Giedre Valunaite Oleskevicienė, Dimitar Trajanov, Ciprian-Octavian Truica, Elena-Simona Apostol, Anna Bączkowska

Website: https://dislidas.mozajka.co/

DESCRIPTION

The Cost Action CA18209 NexusLinguarum (https://nexuslinguarum.eu) is glad to announce the Workshop Discourse studies and linguistic data science: Addressing challenges in interoperability, multilinguality and linguistic data processing – DiSLiDaS. Due to restrictions from Covid-19, the workshop will be held in a hybrid mode, so speakers and attendees can choose to participate onsite or online. See https://dislidas.mozajka.co/ for the call for papers, submission guidelines and the committees

Programme: available at https://dislidas.mozajka.co/

Conference aims and topics

The purpose of the workshop is to gather current research advances in discourse analysis and representation, in the context of multilinguality, from a linguistic and computational perspective. We invite submissions addressing challenges such as interoperability, linguistic linked open data (LLOD), and language processing and analysis.

The workshop topics are the following (but not limited to):

Topics:

Discourse and dialog annotation: Parsing and representation across languages and frameworks
Discourse markers and discourse relations (RST, PDTB, SDRT): Identification, prediction and extraction
Attitudes discovery and interpretation in Discourse: Appraisal and sentiment
Effects of multimodality on discourse interpretation: Intonation, gesture and text
Interoperability for Multilingual language data: Challenges of rich and distributed data
Discourse data and machine learning: Methods and tools

Discourse comprises a wide variety of linguistic phenomena, such as discourse markers, discourse relations, speaker attitude, that have been largely studied by different communities of practice from Linguistics and Computation, rendering several theoretical frameworks (for instance, RST, SDRT, PDTB, for discourse relations; appraisal theory for sentiment analysis,…), and technological approaches, such as transformer models, embeddings and alike. Nonetheless, there are open issues with regards to interoperability, multilinguality, and language processing, in particular, the existence of different annotation schemas, disambiguation, lack of training data for machine learning, scarcity of effective language phenomena detection and interpretation methods, diverse vocabularies, insufficient multilingual parallel corpora of non-dialog and dialog, initial stages of exploration of multimodality.

Discourse research is one of the central research areas of natural language processing (NLP) too. NLP research focuses on formalization, identification and discovery of semantic phenomena, dialogue exchange structure, and coherence of text. Some of the technological approaches of NLP include the use of transformer models, word embeddings, linguistic linked open data, constitution of aligned multilingual corpora, vocabularies of language phenomena and alike. Computational discourse explores the evidence that language consists not only in placing words in the right order but also in detection and interpretation of the meaning and deeper textual relations as well as organizing ideas into a logical textual flow. The linguistic approaches study language phenomena referring to coherence and cohesiveness of discourse, lexical, phrasal, syntactic, semantic and pragmatic means to express discourse relations, represent their roles and build language resources for them.

Despite all the advances, there are still plenty of unresolved problems related to interoperability, multilinguality, and language processing. With the growth of the Semantic Web and Linguistic Linked Data, interoperability is key to read, to interpret and to adopt language resources. The existence of different annotation schemas to encode discourse relations constitutes a problem to allow data exchange and re-use on the one hand and to provide theoretical consistency when producing annotated corpora. Ideally, the model is custom designed to deal with all the specificities of a particular dataset, but also broad enough so that it can be applied to other datasets. Many proposals try to achieve this balance, one of them being ISO 24617. The treatment of multilinguality is also complicated because of the insufficiency of multilingual parallel corpora of collections of non-dialog and dialog texts, that would allow systematic contrastive studies. As to language processing, the lack of training data for machine learning, coupled with the scarcity of effective language phenomena detection and interpretation methods, the coexistence of diverse vocabularies, and the minimal attention to the contribution of the tone of voice, intonation, gestures to the meaning and the informative value of discourse elements makes the task of discourse processing still very challenging.

The workshop intends to be a forum of discussion for researchers interested in addressing the aforementioned challenges and in advancing the-state-of-art in discourse studies and linguistic data science.