Yarowsky Lab

Low-Resource Languages Lab at JHU

We focus on minimally supervised ("low-resource") and massively multilingual techniques in machine learning (ML) and natural language processing (NLP). We apply these methods to machine translation, natural language understanding, speech recognition, and lexicon induction. We are also the core of the Universal Morphology (UniMorph) project and the c(ur|re)ators of the Johns Hopkins University Bible Corpus.

We are led by David Yarowsky, ACL Fellow and Treasurer, Professor of Computer Science, and member of the multi-departmental Center for Language and Speech Processing at Johns Hopkins University (JHU), who is also affiliated with the Human Language Technology Center of Excellence.

On campus? Visit us in Hackerman 226.

We are seeking talented undergraduate, PhD, and master's students interested in linguistics, machine learning, and NLP for several research projects in multilingual and dialectical NLP in speech and text modalities. If interested, please email David and cc Niyati: {yarowsky, nbafna1} at jhu dot edu.

Lab News

COLING 2024 Best Student Paper Award: When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages , by Niyati Bafna, Cristina Espana-Bonet, Josef van Genabith, Benoit Sagot, and Rachel Bawden.
Congratulations to Dr. Arya McCarthy on successfully defending his dissertation Structured Analysis and Translation of Thousands of Languages ! Arya is now a research scientist at Scaled Cognition.
Three members/graduates' papers won awards at ACL 2023: John Hewitt, Aaron Mueller, and Arya McCarthy.
Accepted to EACL 2023: "Meeting the needs of low-resource languages: Automatic alignments via pretrained models" by Ebrahimi, McCarthy, Oncevay, Ortega, Chiruzzo, Coto-Solano, Gimenez-Lugo, and Kann.
Accepted to COLING 2022: "Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages" by Botev, McCarthy, Wu, and Yarowsky
Two papers accepted to Findings of ACL 2022.
Accepted as spotlight to ICLR 2022: On the Uncomputability of Partition Functions in Energy-Based Sequence Models by Lin and McCarthy
Congratulations to Dr. Winston Wu, our latest graduate! His dissertation Computational Word Formation and Etymology was successfully defended on 7 January 2022. We congratulate him on his new position at the University of Michigan!
David Yarowsky has been recognized by the ACL with the "Test of Time" award for contributions to NLP with long-lasting impact on the community.

Research Interests

Multilingual NLP

Unsupervised or minimally supervised machine translation - using only small amounts of bitext, bilingual lexicons, or other available resources in data-scarce language pairs
Cross-lingual transfer: leveraging tools and resources in high-resource languages to benefit their lower-resourced family members
Code-switching
Evaluation: Building or bootstrapping evaluation resources for low-resource languages

Speech

Low-resource speech recognition and speech translation (unsupervised from audio, minimally supervised with small amounts of transcribed speech)
Dialectical robustness: Making speech systems better at handling dialectical and accent variation
Domain adaptation of ASR from unlabelled audio (+ raw in-domain text)
ASR/ST for the clinical domain

Computational Linguistics

Inflectional and derivational morphology
Word sense disambiguation
Broad-coverage core NLP tools for 800+ world languages (massively multilingual NLP)

Information Extraction

Biographic fact extraction
Characterizing communicants

Publications

(). . In . In . , Masters thesis. Masters thesis, . (), . ed.,

We're still adding earlier papers! For now, be sure to check Google Scholar.

Current members

PhD students

Niyati Bafna

Master's students

Georgie Botev
Emre Ozgu
Jamie Scharf

Undergraduates

Kevin Kim

Alumni

(Student co-authors, including undergraduates. Bolded if David advised their dissertation or supervised their postdoc)

Arya McCarthy at Scaled Cognition
Aaron Mueller at Northeastern University
Winston Wu at University of Hawaiʻi at Hilo
Milind Agarwal, now PhD student at George Mason University with Antonis Anastasopoulos
Rachel Wicks
Amrit Nidhi
Sabrina Mielke
Garrett Nicolai at University of British Columbia
Trevor Lee at DoorDash
Oliver Adams at Atos zData
Chris Kirov at Google
Ryan Cotterell at ETH Zurich
Dylan Lewis at Peacock TV
Steven Shearing
Ryan Newell at Amazon
Lawrence Wolf-Sonkin at Google
Patrick Xia
John Hewitt, now PhD student at Stanford with Chris Manning and Percy Liang
John Sylak-Glassman at Meta
Nidhi Vyas at Apple
Sarah Mihuc
Roger Que at Google
Jin Yong Shin
Ann Irvine, Head of Data Science at Arceo
Svitlana Volkova, Senior Research Scientist at Pacific Northwest National Labs
Mozhi Zhang
Delip Rao at University of Pennsylvania
Chris Callison-Burch at University of Pennsylvania
Elliot F. Drábek at Atreca
Nikesh Garera at Treebo
Shane Bergsma
Charles Schafer at Google
Gideon Mann, Head of Data Science at Bloomberg
Silviu Cucerzan, Principal Research Manager at Microsoft Bing
Richard Wicentowski, Chair of Computer Science at Swarthmore College
Radu Florian at IBM Research
Grace Ngai at Hong Kong Polytechnic University
John Henderson at Mitre