Building a Multilingual Taxonomy of Olfactory Terms with Timestamps
Stefano Menini, Teresa Paccosi, Serra Sinem Tekiroglu, Sara Tonelli
Abstract. Olfactory references play a crucial role in our memory and, more generally, in our experiences, since researchers have shown that smell is the sense that is most directly connected with emotions. Nevertheless, only few works in NLP have tried to capture this sensory dimension from a computational perspective. One of the main challenges is the lack of a systematic and consistent taxonomy of olfactory information, where concepts are organised also in a multi-lingual perspective. WordNet represents a valuable starting point in this direction, which can be semi-automatically extended taking advantage of Google n-grams and of existing language models.
In this work we describe the process that has led to the semi-automatic development of a taxonomy for olfactory information in four languages (English, French, German and Italian), detailing the different steps and the intermediate evaluations. Along with being multi-lingual, the taxonomy also encloses temporal marks for olfactory terms thus making it a valuable resource for historical content analysis. The resource has been released and is freely available.
Resource available on Github.
Work Hard, Play Hard: Collecting Acceptability Annotations through a 3D Game
Federico Bonetti, Elisa Leonardelli, Daniela Trotta, Raffaele Guarasci, Sara Tonelli
Abstract. Corpus-based studies on acceptability judgements have always stimulated the interest of researchers, both in theoretical and computational fields. Some approaches focused on spontaneous judgements collected through different types of tasks, others on data annotated through crowd-sourcing platforms, still others relied on expert annotated data available from the literature. The release of CoLA corpus, a large-scale corpus of sentences extracted from linguistic handbooks as examples of acceptable/non acceptable phenomena in English, has revived interest in the reliability of judgements of linguistic experts vs. non-experts. Several issues are still open. In this work, we contribute to this debate by presenting a 3D video game that was used to collect acceptability judgments on Italian sentences. We analyse the resulting annotations in terms of agreement among players and by comparing them with experts’ acceptability judgments. We also discuss different game settings to assess their impact on participants’ motivation and engagement. The final dataset containing 1,062 sentences, which were selected based on majority voting, is released for future research and comparisons.
KIND: an Italian Multi-Domain Dataset for Named Entity Recognition
Teresa Paccosi, Alessio Palmero Aprosio
Abstract. In this paper we present KIND, an Italian dataset for Named-Entity Recognition.
It contains more than one million tokens with the annotation covering three classes: persons, locations, and organizations.
Most of the dataset (around 600K tokens) contains manual gold annotations in three different domains: news, literature, and political discourses.
Texts and annotations are downloadable for free from the Github repository.