In this work we build upon the linguistic annotation work of Mirko Tavoni of Dante’s corpus to develop a Part of Speech Tagger (PoS) of XIII century Italian language.

The objective of the work is twofold:

  1. to provide the NLP community with a tool to perform automatic processing of ancient text and
  2. to provide the literature community with more powerful tools for simplifying the annotation process and performing more advanced data analysis.

In D(h)ante we provide the following tools (the code is open source):

  • XSLT to convert TEI 2 XML format into CoNLL format;
  • TreeTagger and Stanford PoS taggers trained on Dante’s corpus.



Angelo Basile: angelo.basile[AT]