Yarowsky Lab
Low-Resource Languages Lab at JHU
We focus on minimally supervised ("low-resource") and massively multilingual techniques in machine learning (ML) and natural language processing (NLP). We apply these methods to machine translation, natural language understanding, speech recognition, and lexicon induction. We are also the core of the Universal Morphology (UniMorph) project and the c(ur|re)ators of the Johns Hopkins University Bible Corpus.
We are led by David Yarowsky, ACL Fellow and Treasurer, Professor of Computer Science, and member of the multi-departmental Center for Language and Speech Processing at Johns Hopkins University (JHU), who is also affiliated with the Human Language Technology Center of Excellence.
On campus? Visit us in Hackerman 226.
We are seeking talented undergraduate, PhD, and master's students interested in linguistics, machine learning, and NLP for several research projects in multilingual and dialectical NLP in speech and text modalities. If interested, please email David and cc Niyati: {yarowsky, nbafna1} at jhu dot edu.
Lab News
- COLING 2024 Best Student Paper Award: When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages , by Niyati Bafna, Cristina Espana-Bonet, Josef van Genabith, Benoit Sagot, and Rachel Bawden.
- Congratulations to Dr. Arya McCarthy on successfully defending his dissertation Structured Analysis and Translation of Thousands of Languages ! Arya is now a research scientist at Scaled Cognition.
- Three members/graduates' papers won awards at ACL 2023: John Hewitt, Aaron Mueller, and Arya McCarthy.
- Accepted to EACL 2023: "Meeting the needs of low-resource languages: Automatic alignments via pretrained models" by Ebrahimi, McCarthy, Oncevay, Ortega, Chiruzzo, Coto-Solano, Gimenez-Lugo, and Kann.
- Accepted to COLING 2022: "Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages" by Botev, McCarthy, Wu, and Yarowsky
- Two papers accepted to Findings of ACL 2022.
- Accepted as spotlight to ICLR 2022: On the Uncomputability of Partition Functions in Energy-Based Sequence Models by Lin and McCarthy
- Congratulations to Dr. Winston Wu, our latest graduate! His dissertation Computational Word Formation and Etymology was successfully defended on 7 January 2022. We congratulate him on his new position at the University of Michigan!
- David Yarowsky has been recognized by the ACL with the "Test of Time" award for contributions to NLP with long-lasting impact on the community.
Research Interests
Multilingual NLP
- Unsupervised or minimally supervised machine translation - using only small amounts of bitext, bilingual lexicons, or other available resources in data-scarce language pairs
- Cross-lingual transfer: leveraging tools and resources in high-resource languages to benefit their lower-resourced family members
- Code-switching
- Evaluation: Building or bootstrapping evaluation resources for low-resource languages
Speech
- Low-resource speech recognition and speech translation (unsupervised from audio, minimally supervised with small amounts of transcribed speech)
- Dialectical robustness: Making speech systems better at handling dialectical and accent variation
- Domain adaptation of ASR from unlabelled audio (+ raw in-domain text)
- ASR/ST for the clinical domain
Computational Linguistics
- Inflectional and derivational morphology
- Word sense disambiguation
- Broad-coverage core NLP tools for 800+ world languages (massively multilingual NLP)
Information Extraction
- Biographic fact extraction
- Characterizing communicants
Publications
We're still adding earlier papers! For now, be sure to check Google Scholar.
Current members
PhD students
Master's students
- Georgie Botev
- Emre Ozgu
- Jamie Scharf
Undergraduates
- Kevin Kim
Alumni
(Student co-authors, including undergraduates. Bolded if David advised their dissertation or supervised their postdoc)
- Arya McCarthy at Scaled Cognition
- Aaron Mueller at Northeastern University
- Winston Wu at University of Hawaiʻi at Hilo
- Milind Agarwal, now PhD student at George Mason University with Antonis Anastasopoulos
- Rachel Wicks
- Amrit Nidhi
- Sabrina Mielke
- Garrett Nicolai at University of British Columbia
- Trevor Lee at DoorDash
- Oliver Adams at Atos zData
- Chris Kirov at Google
- Ryan Cotterell at ETH Zurich
- Dylan Lewis at Peacock TV
- Steven Shearing
- Ryan Newell at Amazon
- Lawrence Wolf-Sonkin at Google
- Patrick Xia
- John Hewitt, now PhD student at Stanford with Chris Manning and Percy Liang
- John Sylak-Glassman at Meta
- Nidhi Vyas at Apple
- Sarah Mihuc
- Roger Que at Google
- Jin Yong Shin
- Ann Irvine, Head of Data Science at Arceo
- Svitlana Volkova, Senior Research Scientist at Pacific Northwest National Labs
- Mozhi Zhang
- Delip Rao at University of Pennsylvania
- Chris Callison-Burch at University of Pennsylvania
- Elliot F. Drábek at Atreca
- Nikesh Garera at Treebo
- Shane Bergsma
- Charles Schafer at Google
- Gideon Mann, Head of Data Science at Bloomberg
- Silviu Cucerzan, Principal Research Manager at Microsoft Bing
- Richard Wicentowski, Chair of Computer Science at Swarthmore College
- Radu Florian at IBM Research
- Grace Ngai at Hong Kong Polytechnic University
- John Henderson at Mitre