Nandan Sarkar

MIT Department: Electrical Engineering and Computer Science
Faculty Mentor: Prof. Yoon Kim

Research Supervisor: Lucas Torroba Hennigen
Undergraduate Institution: Yale University
Hometown: Nashville, Tennessee
Website: LinkedIn

Biography

Nandan Sarkar is a Junior from Nashville, Tennessee, studying Computer Science and Applied Math at Yale. He is very interested in Machine Learning and Quantum Computing and is committed to pursuing a Ph.D. in these fields. At Yale, he is an undergraduate researcher in the YaleNLP lab under Professor Arman Cohan. This summer at MIT, he is a member of the Computation and Language lab under Professor Yoon Kim. Previously, Nandan conducted research at Vanderbilt on Human-Computer Interaction and Augmented Reality. Since freshman year, Nandan has been deeply involved in Code Haven, a student-run organization dedicated to expanding Computer Science education for middle school students in New Haven. Outside of school, Nandan enjoys playing squash, watching detective shows, and playing poker with friends. He is also a big soccer and football fan (a die-hard fan of Barcelona and the Tennessee Titans) and loves country music and hot chicken.

Abstract

Enhancing Data Efficiency for Transformer Models Using
Probabilistic Context-Free Grammars

Nandan Sarkar1, Lucas Torroba Hennigen2 and Yoon Kim2
1Department of Computer Science, Yale University
2Department of Electrical Engineering and Computer Science,
Massachusetts Institute of Technology

Neural language models, particularly those based on the Transformer architecture, achieve state-of-the-art results in various natural language processing (NLP) tasks. However, optimizing these models at human-sized data scales remains a challenge. Our research investigates the enhancement of data efficiency in language models utilizing Probabilistic Context-Free Grammars (PCFGs). Natural language, though linear in sequence, is inherently hierarchical. PCFGs are well-suited to model this structure. Thus, we train a PCFG on a small corpus of approximately ten million tokens to generate synthetic data to use for pretraining language models with varying parameter counts, which are then fine-tuned on the real data to evaluate effectiveness. Leveraging the hierarchical structure of natural language modeled by PCFGs allows us to simulate more complex and varied linguistic structures and patterns, thereby enriching the pretraining data. Our methodology focuses on enhancing the pretraining process for language models by making it more efficient and effective on smaller datasets, aiming to both optimize small-scale pretraining and improve the models’ ability to learn hierarchical language structures. Preliminary results indicate that this approach can significantly enhance the performance of language models on limited data, thus improving data efficiency. This research has the potential to provide a scalable method for efficient language model training.

« Back to profiles