Four papers accepted at CLIC-it 2024

We have four papers accepted at the Tenth Italian Conference on Computational Linguistics, which will take place in Pisa in 4-6 December 2024. Below a short description:

Benchmarking the Semantics of Taste: Towards the Automatic Extraction of Gustatory Language

Teresa Paccosi and Sara Tonelli

In this paper, we present a benchmark containing texts manually annotated with gustatory semantic information. We employ a FrameNet-like approach previously tested to address olfactory language, which we adapt to capture gustatory events. We then propose an exploration of the data in the benchmark to show the possible insights brought by this type of approach, addressing the investigation of emotional valence in text genres. Eventually, we present a supervised system trained with the taste benchmark for the extraction of gustatory information from historical and contemporary texts.

Nesciun Lengaz Lascià Endò: Machine Translation for Fassa Ladin

Giovanni Valer, Nicolò Penzo, Jacopo Staiano

Despite the remarkable success recently obtained by Large Language Models, a significant gap in performance still exists when dealing with low-resource languages which are often poorly supported by off-the-shelf models. In this work we focus on Fassa Ladin, a Rhaeto-Romance linguistic variety spoken by less than ten thousand people in the Dolomitic regions, and set to build the first bidirectional Machine Translation system supporting Italian, English, and Fassa Ladin. To this end, we collected a small though representative corpus compounding 1135 parallel sentences in these three languages, and spanning five domains. We evaluated several models including the open (Meta AI’s No Language Left Behind, NLLB-200) and commercial (OpenAI’s gpt-4o) state-of-the-art, and indeed found that both obtain unsatisfactory performance. We therefore proceeded to fine-tune the NLLB-200 model on the data collected, using different approaches. We report a comparative analysis of the results obtained, showing that 1) jointly training for multilingual translation (Ladin-Italian and Ladin-English) significantly improves the performance, and 2) knowledge-transfer is highly effective (e.g., leveraging similarities between Ladin and Friulian), highlighting the importance of targeted data collection and model adaptation in the context of low-resource/endangered languages for which little textual data is available.

KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents

Lorenzo Bocchi, Camilla Casula and Alessio Palmero Aprosio

The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train different pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.

(Not) All You Need for EuroVoc Multi-Label Classification of European Laws

Lorenzo Bocchi and Alessio Palmero Aprosio

Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the classification of documents issued by the PA. In this paper, we explore some methods to perform document classification in the legal domain, in order to mitigate the length limitation for input texts in BERT models. We first collect data from the European Union, already tagged with the aforementioned taxonomy. Then we reorder the sentences included in the text, with the aim of bringing the most informative part of the document in the first part of the text. Results show that the title and the context are both important, although the order of the text may not. Finally, we release on GitHub both the dataset and the source code used for the experiments.

Recent Posts