We have three papers accepted at the Seventh Italian Conference on Computational Linguistics, which will (hopefully) take place in Bologna in March 2021. Below a short description:
Hate Speech Detection with Machine-Translated Data: The Role of Annotation Scheme, Class Imbalance and Undersampling
by Camilla Casula and Sara Tonelli
While using machine-translated data for supervised training can alleviate data sparseness problems when dealing with less-resourced languages, it is important that the source data are not only correctly translated, but also follow the same annotation scheme and possibly class balance as the smaller dataset in the target language. We therefore present an evaluation of hate speech detection in Italian using machine-translated data from English and comparing three settings, in other to understand the impact of training size, class distribution and annotation scheme.
A Multimodal Dataset of Images and Text to Study Abusive Language
by Stefano Menini, Alessio Palmero Aprosio and Sara Tonelli
In this paper, we present a novel dataset composed of images and comments in Italian, created with teenagers in classes using a simulated scenario to raise awareness on cyberbullying phenomena. Potentially offensive comments have been collected for more than 1,000 images and manually assigned to a semantic category. Our analysis shows that the presence of human subjects, as well as the gender of the people present in the pictures trigger different types of comment, and provides novel insight into the connection between images posted on social media and offensive messages.
The CREENDER Tool for Creating Multimodal Datasets of Images and Comments
by Alessio Palmero Aprosio, Stefano Menini and Sara Tonelli
While text-only datasets are widely produced and used for research purposes, limitations set by image-based social media platforms like Instagram make it difficult for researchers to experiment with multimodal data. We therefore developed CREENDER, an annotation tool to create multimodal datasets with images associated with semantic tags and comments, which we make freely available under Apache 2.0 license. The software has been extensively tested with school classes, allowing us to improve the tool and add useful features not planned in the first development phase.