Researching discourse markers expressing opinion with machine learning techniques in a multilingual corpus

Dates: August 16, 2021 to August 30, 2021

Duration: 14 days

Applicant: Giedre Valunaite Oleskeviciene

Venue: Sofia, Bulgaria

Host Institution: Mozaika, Ltd.

Host: Dr Mariana Damova

Involved WGs: WG4

DESCRIPTION

Sofia STSM by Giedre Valunaite Oleskeviciene

The STSM “Researching discourse markers expressing opinion with machine learning techniques in a multilingual corpus” held at the Mozaika Research Institute in Sofia during the period from 16/08/2021 to 30/08/2021 was a breakthrough point in the use case task 4.2.2 (Social sciences).

The purpose of the STSM was providing linguistic processing for several languages analyzing multilingual corpus data in English, Bulgarian and Lithuanian in preparation of the data for the research of automatic detection of discourse markers expressing opinion by using machine learning.

First, we enriched TED-EHL parallel corpus based social media texts with 4 languages so that the multilingual corpus contains alignments of Lithuanian, Bulgarian, Hebrew, Portuguese, Macedonian, and German languages with English as pivot language with a size of 1.3 million sentences. Then the part of the enriched multilingual corpus comprising 2428 English-Bulgarian-Lithuanian aligned sentences containing the multiword expressions (MWE) as discourse markers or content expressions was manually annotated (1 or 0) in preparation for the machine learning experiments. The manual annotation of the data was carried out in order to refine the data for the successive elaboration aiming to reach automatic detection of discourse markers expressing opinion. So during the STSM we produced the gold standard which was later applied for machine learning. Socially, I learnt a lot about the rich culture of the country and as a result now the light scent of roses, soft mountain lines and the warm character and friendliness of people are related to Sofia (Bulgaria) in my mind.