You are here

Linguistic resource

The corpus of Alcide De Gasperi's public documents is a comprehensive collection of documents issued between 1901 and 1954, which had been previously published in four volumes by Il Mulino but were not machine-readable. Our repository contains all documents in three formats: txt, XML and tab-separated. Raw txt files contain only the body of the documents, and may be straightforwardly used to extract embeddings or topics.

We developed a WhatsApp dataset to study cyberbullying among Italian students aged 12-13 in the context of the CREEP EIT project.
The corpus of Whatsapp chats is made of 14,600 tokens divided in 10 chats. All the chats have been annotated by two annotators using the CAT web-based tool following the same guidelines. 

We manually annotated a corpus of 100,000 tokens taken from a collection of English travel writings (both travel reports and guidebooks) about Italy published in the second half of the XIX century and the ’30s of the XX century. The corpus is annotated in BIO format using the tag LOCATION to mark all named entities (including nicknames) referring to: (i) geographical locations; (ii) political locations; (iii) functional locations.

WHAT: a collection of travel writings - non-fictional narratives (reports, diaries, letters) and guidebooks - about Italy written by English native authors and published between the country unification and the beginning of the 30's;

WHY: travel writings can support historical, social, ethnographic, and architectural research but they are also a source of curious information about life in Italy in the past;

We have created a github repository that contains:

  • annotation guidelines designed to detect and classify event mentions in texts;
  • a corpus of historical texts annotated with events (span + class) following the previously mentioned guidelines.

Due to space limitations, the following resources are in an external Google Drive folder (

This resource contains two datasets. Each dataset consists of pairs of arguments from Nixon's and Kennedy’s speeches related to a topic and annotated with a relation of "attack", "support" or "no_relation".

The release includes the following versions of the dataset:

Full_dataset: A collection of 1907 pairs of arguments by Nixon and Kennedy from the 1960 presidential campaign. Each pair has been manually annotated with a relation of "attack", "support" or "no_relation".

The present resource is about the automatic identification of English-Italian code-mixing in English historical travel writings about Italy. We release:

The Content Types Dataset is a new resource aiming at promoting the analysis of texts as a composition of units with specific semantic and functional roles. By developing this dataset we also introduce a new NLP task for the automatic classification of Content Types. The identification of Content Types may improve the performance of more complex NLP tasks by targeting the portions of the documents that are more relevant.

The current release (February 2017) includes:

SIMPITIKI is a Simplification corpus for Italian and it consists of two sets of simplified pairs: the first one is harvested from the Italian Wikipedia in a semi-automatic way; the second one is manually annotated sentence-by-sentence from documents in the administrative domain.

This resource includes three datasets. Each dataset consists of pairs of snippets related to a topic and annotated as in agreement or disagreement.

The three datasets are:

1960 Elections Dataset : A collection of 350 pairs of snippets (5 blocks of 3 sentences each) by Nixon and Kennedy from the 1960 presidential campaign. Each pair is manually annotated with  agreement/disagreement relation, sentiment, and similarity of the solution proposed with respect of the debated topic