The present resource is about the automatic identification of English-Italian code-mixing in English historical travel writings about Italy. We release:
- the domain corpus made of travel narratives and guidebooks published between the end of the XIX Century and the beginning of the XX Century. Texts are extracted from Project Gutenberg catalog;
- the data used to retrain and test two state-of-the-art tools for automatic code-mixing: one by King and Abney (2013) [1] and one by Schulz and Keller (2016) [2]. To download the code of the first tool: http://www-personal.umich.edu/~benking/resources/langid_release.tar.gz; to download the second: https://github.com/sarschu/CodeSwitching
- the Italian words/expressions extracted from the domain corpus.
This resource is available on our github page: https://github.com/dhfbk/code-mixing
When using this resource, please cite:
“A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing“, by Rachele Sprugnoli, Sara Tonelli, Giovanni Moretti and Stefano Menini. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017). Rome. (PDF)
[1] King, Ben, and Steven P. Abney. “Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods.” In HLT-NAACL, pp. 1110-1119. 2013.
[2] Schulz, Sarah, and Mareike Keller. “Code-switching ubique est-language identification and part-of-speech tagging for historical mixed text.” Proc. of LaTeCH (2016).
Contacts:
sprugnoli[at]fbk.eu