Thursday, 25 July 2019 | 13:00 to 14:00
It is standard practice in speech & language technology to rank systems according to performance on a test set held out for evaluation. However, few researchers apply statistical tests to determine whether differences in performance are likely to arise by chance, and few examine the stability of system ranking across multiple training-testing splits. We conduct replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018, each of which reports state-of-the-art performance on a widely-used “standard split”. We fail to reliably reproduce some rankings using randomly generated splits. We use this result to argue for a novel evaluation technique we call Bonferroni-corrected random split hypothesis testing. (This work was performed in collaboration with Steven Bedrick.)
Kyle Gorman is assistant professor of linguistics at the Graduate Center, City University of New York, and director of the masters program in computational linguistics. He also works as a software engineer at Google Research. Before that, he was assistant professor at the Center For Spoken Language Understanding, Oregon Health & Science University in Portland. He holds a PhD in linguistics from the University of Pennsylvania, where he was advised by Charles Yang. His research interests include phonology & morphology and speech & language technology, particularly finite-state methods. He is a maintainer of OpenFst and OpenGrm and the creator of Pynini.


Fondazione Bruno Kessler, Room 211, Povo