Blog Post on the STSM “Building Language Corpora for Albanian”, by Dr. Manjola Zaçellari at Friedrich-Alexander-Universität, Erlangen-Nürnberg, Germany
In today’s world, where language plays a vital role in communication and understanding, the need for comprehensive language resources is more important than ever. However, for low-resource languages like Albanian, building language corpora can be a challenging task. In this blog post, I will share my experience during my short scientific mission (STSM) at Friedrich-Alexander-Universität Erlangen-Nürnberg in Germany (02-06 October, 2023), where together with the host Dr. Besim Kabashi we worked on building language corpora for Albanian for research and educational purposes.
The aim was to investigate and analyze syntactic annotations from an existing Albanian framework and to consider ways how to enrich and/or enhance them for both research and educational purposes. Our work focused on the Albanian treebank in the Universal Dependencies.
During my stay at the Friedrich-Alexander-Universität Erlangen-Nürnberg, my primary focus was to refine the existing UD treebanks for Albanian (https://universaldependencies.org/treebanks/sq_tsa/index.html; https://universaldependencies.org/treebanks/aln_gps/index.html).
These treebanks serve as valuable resources for linguistic analysis and natural language processing tasks. By adding new features, specifically syntactic relations, we aimed to enhance the quality and accuracy of the treebanks.
In addition to refining the existing treebanks, we recognized the need to expand the corpus size for Albanian. A larger corpus allows for more comprehensive analysis and provides researchers and educators with a broader range of data to work with.
Throughout my short scientific mission, I had the privilege of collaborating with an expert in the field of linguistics and natural language processing such as Dr. Besim Kabashi. This collaborative effort not only enriched my understanding of the subject but also paved the way for future research and development in the field of Albanian language resources. Moving forward, we aim to continue refining the existing treebanks, by also exploring new avenues for data collection and annotation. Moreover, this STSM has served me as a starting point for my ongoing collaborating with Dr. Kabashi to working for a new, more precise and larger UD treebank for the Albanian language.