The paper “There’s Something New about the Italian Parliament: the IPSO Corpus” authored by Valentino Frasnelli and Alessio Palmero Aprosio has been accepted at the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).

Abstract

Parliamentary debates constitute a substantial and somewhat underutilized reservoir of publicly available written content.
Despite their potential, the Italian parliamentary documents remain largely unexplored and most importantly inaccessible in their original paper-based form.
In this paper we attempt to transform these valuable historical documents into IPSO, a digitally readable structured corpus containing speeches, report of the Standing Committees, and law proposal spanning 175 years of Italian history, from the issuing of the Statuto Albertino in 1848, up to the present day.
At first, the PDF documents, available on the official websites of Senato della Repubblica and Camera dei Deputati, the two chambers the form the Italian Parliament, are digitalized using OCR techniques.
Then, the speeches are tagged with the corresponding speakers. The final dataset is released both in textual and structured format.