We manually annotated a corpus of 100,000 tokens taken from a collection of English travel writings (both travel reports and guidebooks) about Italy published in the second half of the XIX century and the ’30s of the XX century. The corpus is annotated in BIO format using the tag LOCATION to mark all named entities (including nicknames) referring to: (i) geographical locations; (ii) political locations; (iii) functional locations.

The corpus has been used to retrain the Stanford NER module and to train new models using the neural architecture proposed by Reimers and Gurevych tested with several pre-trained word embeddings.

The resource is available on our GitHub together with our best model and other additional information: https://github.com/dhfbk/Detection-of-place-names-in-historical-travel-writings

Please cite the following paper:

  • Rachele Sprugnoli. 2018. Arretium or Arezzo? A Neural Approach to the Identification of Place Names in Historical Texts. Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). Torino, Italy, December 10-12, 2018.