Blog Post on the STSM “Building language corpora for educational and research purposes”, by Enriketa Sogutlu at Friedrich-Alexander-Universität, Erlangen-Nürnberg, Germany
The STSM “Building language corpora for educational and research purposes” took place from October 2, 2023 to October 6, 2023 at Friedrich-Alexander-Universität, Erlangen-Nürnberg, in Germany, under the supervision of Dr.Besim Kabashi at the Computational Corpus Linguistics Group. The main objective of this STSM was to contribute to work carried out in building corpora in a low-resourced language such as Albanian.
The aim was to investigate and analyze syntactic annotations from an existing Albanian framework and to consider ways how to enrich and/or enhance them for both research and educational purposes. Our work focused on the Albanian treebank in the Universal Dependencies.
The main results of the STSM can be summarized as follows:
- First an introductory workshop on building linguistic corpora and the steps involved in it was organized.
- Then Universal Dependencies https://universaldependencies.org/ tree bank was explored in terms of component parts: tokenization and word segmentation, POs features, morphology, and syntax. As syntax constituted the focus of the visit, its description in the Albanian tree bank https://universaldependencies.org/sq/index.html and the six components included in it were explored and analyzed.
- Out of the 60 sentences in the Albanian tree bank (https://github.com/UniversalDependencies/UD_Albanian-TSA/blob/master/sq_tsa-ud-test.conllu, 10 were randomly selected. Sentences considered as needing improvement were double checked and discussed. Initially the issues identified in the existing annotation were discussed, then ways to improve them were considered.
- A final discussion of results demonstrated that the most common issues were the need to enrich syntax content and categories and their explanations in terms of sentence and clause types; and a reconsideration of the classification of the verb “to be”. We also discussed the need for and the possibility of creating a new Albanian dataset from scratch.
- Finally, examples from the Albanian treebank including challenges and how to address them will be used for educational purposes in the instruction of syntactic dependencies mostly focusing on comparative analysis of English and Albanian dataset included in the UD.
My visit to Friedrich-Alexander-Universität in Erlangen was extremely enriching. I had the opportunity to become more familiar with Dr.Besim Kabashi’s work on Albanian and his immense resources.