Two long papers have been accepted at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023).
- “Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?” by Camilla Casula and Sara Tonelli
- “Why Don’t You Do It Right? Analysing Annotators’ Disagreement in Subjective Tasks” by Marta Sandri, Elisa Leonardelli, Sara Tonelli and Elisabetta Jezek (University of Pavia).
Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias.
Annotators’ disagreement in linguistic data has been recently the focus of multiple initiatives aimed at raising awareness on issues related to ‘majority voting’ when aggregating diverging annotations. Disagreement can indeed reflect different aspects of linguistic annotation, from annotators’ subjectivity to sloppiness or lack of enough context to interpret a text. In this work we first propose a taxonomy of possible reasons leading to annotators’ disagreement in subjective tasks. Then, we manually label part of a Twitter dataset for offensive language detection in English following this taxonomy, identifying how the different categories are distributed. Finally we run a set of experiments aimed at assessing the impact of the different types of disagreement on classification performance. In particular, we investigate how accurately tweets belonging to different categories of disagreement can be classified as offensive or not, and how injecting data with different types of disagreement in the training set affects performance. We also perform offensive language detection as a multi-task framework, using disagreement classification as an auxiliary task.