We focus on minimally supervised ("low-resource") and massively multilingual techniques in machine learning (ML) and natural language processing (NLP). We apply these methods to machine translation, speech recognition, lexicon induction, and historical linguistics. We are also the core of the Universal Morphology (UniMorph) project and the c(ur|re)ators of the Johns Hopkins University Bible Corpus.
We are led by David Yarowsky, ACL Fellow and Treasurer, Professor of Computer Science, and member of the multi-departmental Center for Language and Speech Processing at Johns Hopkins University (JHU), who is also affiliated with the Human Language Technology Center of Excellence.
On campus? Visit us in Hackerman 226.
- Three members/graduates' papers won awards at ACL 2023: John Hewitt, Aaron Mueller, and Arya McCarthy.
- Accepted to EACL 2023: "Meeting the needs of low-resource languages: Automatic alignments via pretrained models" by Ebrahimi, McCarthy, Oncevay, Ortega, Chiruzzo, Coto-Solano, Gimenez-Lugo, and Kann.
- Accepted to COLING 2022: "Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages" by Botev, McCarthy, Wu, and Yarowsky
- Two papers accepted to Findings of ACL 2022.
- Accepted as spotlight to ICLR 2022: On the Uncomputability of Partition Functions in Energy-Based Sequence Models by Lin and McCarthy
- Congratulations to Dr. Winston Wu, our latest graduate! His dissertation Computational Word Formation and Etymology was successfully defended on 7 January 2022. We congratulate him on his new position at the University of Michigan!
- David Yarowsky has been recognized by the ACL with the "Test of Time" award for contributions to NLP with long-lasting impact on the community.
Core Learning Techniques
- Cross-language information projection
- Cross-domain knowledge transfer
- Active learning and human computation
- Creative bootstrapping from multiple knowledge sources
- Translation discovery without aligned bilingual text (unsupervised machine translation)
- Exploiting language universals and language family relationships (linguistic typology)
Natural Language Processing
- Inflectional and derivational morphology
- Word sense disambiguation
- Broad-coverage core NLP tools for 800+ world languages (massively multilingual NLP)
- Biographic fact extraction
- Characterizing communicants
We're still adding earlier papers! For now, be sure to check Google Scholar.
- Georgie Botev
- Emre Ozgu
- Jamie Scharf
- Kevin Kim
(Student co-authors, including undergraduates. Bolded if David advised their dissertation or supervised their postdoc)
- Winston Wu at University of Michigan
- Milind Agarwal, now PhD student at George Mason University with Antonis Anastasopoulos
- Rachel Wicks
- Amrit Nidhi
- Sabrina Mielke
- Garrett Nicolai at University of British Columbia
- Trevor Lee at DoorDash
- Oliver Adams at Atos zData
- Chris Kirov at Google
- Ryan Cotterell at ETH Zurich
- Dylan Lewis at Peacock TV
- Steven Shearing
- Ryan Newell at Amazon
- Lawrence Wolf-Sonkin at Google
- Patrick Xia
- John Hewitt, now PhD student at Stanford with Chris Manning and Percy Liang
- John Sylak-Glassman at Meta
- Nidhi Vyas at Apple
- Sarah Mihuc
- Roger Que at Google
- Jin Yong Shin
- Ann Irvine, Head of Data Science at Arceo
- Svitlana Volkova, Senior Research Scientist at Pacific Northwest National Labs
- Mozhi Zhang
- Delip Rao at University of Pennsylvania
- Elliot F. Drábek at Atreca
- Nikesh Garera at Treebo
- Shane Bergsma
- Charles Schafer at Google
- Gideon Mann, Head of Data Science at Bloomberg
- Silviu Cucerzan, Principal Research Manager at Microsoft Bing
- Richard Wicentowski, Chair of Computer Science at Swarthmore College
- Radu Florian at IBM Research
- Grace Ngai at Hong Kong Polytechnic University
- John Henderson at Mitre