|MIT Department: Electrical Engineering and Computer Science
Faculty Mentor: Prof. Jacob Andreas
Undergraduate Institution: University of California, Berkeley
Website: Website, LinkedIn
I am currently a rising senior at UC Berkeley studying computer science and linguistics. I am also a first-generation immigrant whose parents fled from Sri Lanka during its civil war to live in the Bay Area. I love everything to do with language – how it works, how humans acquire it, and how it’s structured. I hope to pursue a Ph.D. focused on making language technology more accessible to those who speak understudied languages and uncommon dialects so that as natural language processing becomes more ubiquitous in everyday life, disadvantaged populations are not left behind. When I’m not studying or working in the lab, I spend my time teaching a lower division data structures class, baking, and playing Animal Crossing.
One-Shot Lexicon Learning for Low-Resource Machine Translation
Anjali Kantharuban1 and Jacob Andreas2
1Department of Electrical Engineering and Computer Science,
University of California, Berkeley
2Department of Electrical Engineering and Computer Science,
Massachusetts Institute of Technology
Machine translation is a task that entails translating natural language text with no human involvement. Advanced translation models allow people to communicate across language barriers. Current methods struggle to translate phrases that contain words that appear infrequently in the training data. Low-resource languages in particular are harmed by this because they have such little data that rare words may only be seen once. This has been partially addressed using the copy mechanism, which runs input tokens through a lexicon, or word-level translation table, and outputs them based on contextual information. These lexicons are either built manually, requiring human intervention, or using statistical methods, requiring many training samples. We propose a model which acts as an additional layer in a translation network and generates word alignments using the syntactic structure of training pairs. This model can both generate more accurate lexicons and better adapt to out of vocabulary tokens in new data. Most importantly, this model can allow translation networks to generate lexicon entries for words seen only once during training, improving one-shot translation. This can allow translation systems to better serve low-resource language speakers and show insight on how other tasks can be adapted to function with less data.