You are here
The present resource is about the automatic identification of English-Italian code-mixing in English historical travel writings about Italy. We release:
- the domain corpus made of travel narratives and guidebooks published between the end of the XIX Century and the beginning of the XX Century. Texts are extracted from Project Gutenberg catalog;
- the data used to retrain and test two state-of-the-art tools for automatic code-mixing: one by King and Abney (2013)  and one by Schulz and Keller (2016) . To download the code of the first tool: http://www-personal.umich.edu/~benking/resources/langid_release.tar.gz; to download the second: https://github.com/sarschu/CodeSwitching
- the Italian words/expressions extracted from the domain corpus.
This resource is available on our github page: https://github.com/dhfbk/code-mixing
A paper describing this work is under review.
 King, Ben, and Steven P. Abney. "Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods." In HLT-NAACL, pp. 1110-1119. 2013.
 Schulz, Sarah, and Mareike Keller. "Code-switching ubique est-language identification and part-of-speech tagging for historical mixed text." Proc. of LaTeCH (2016).