On the STSM “Towards a multilingual lexicon of discourse markers”, completed by Purificação Silvano at Mozaika, Sofia, Bulgaria
The two main objectives of the STSM were to: (i) work on the construction of a multilingual semantic vocabulary of discourse markers as LLOD, and (ii) prepare a roadmap for future research.
In line with the work we have been pursuing within working group 4.2.2., we developed a vocabulary of discourse markers for English, Portuguese and Bulgarian. To this end, we used the multilingual parallel corpus with data from nine languages, Bulgarian, Lithuanian, German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language, that was previously built by our working group using the publicly available TED Talk transcripts to study discourse markers. In order to represent the meaning of the discourse markers, we applied an annotation scheme that comprises discourse relations from ISO 24617-8 with a plug-in to ISO 24617-2 for communicative functions. This step will allow us to formalize the ISO-based annotation scheme in an Web Ontology Language (OWL) ontology for publishing and integrating data, and to convert annotations to RDF, link with ontology and perform conjoint queries.
The STSM achieved its planned goals and expected outcomes. Thus, we were able to:
- test the reliability of a comprehensive interoperable Discourse Markers taxonomy able to represent not only the semantic meaning of discourse markers but also their pragmatic meaning in a sample of the multilingual parallel corpus created by working work 4.2.2.;
- create a parallel vocabulary of discourse markers, in three languages, English, Portuguese and Bulgarian, being two of them low-resourced languages;
- develop a taxonomy prototype to be applied to other datasets of different languages Lithuanian, German, Hebrew, Romanian, Polish, Macedonian, Italian, some of which are under-resourced languages, as well;
- perform a quantitative and qualitative study of the semantic and pragmatic role of discourse markers across three languages;
- and devise a plan to formulate a proposal of future research that can follow-up the investigation that we have been conducting in working group 4.2.2..
The overall assessment of this STSM in the vibrant city of Sofia is very positive, insofar as it enabled advancement on the work of the use case for linguistic data science, while simultaneously contributing to the achievement of the scientific objectives of NexusLinguarum Action.
Hyperlin to the company “Mozaika”, the host of the STSM: https://mozajka.co/