3 Questions: Jacob Andreas on large language models

Jacob Andreas is broadly interested in using language as a communicative and computational tool.

The CSAIL scientist pushes forward natural language processing research by creating state-of-the-art machine learning models and investigating how language can enhance other types of artificial intelligence.

Rachel Gordon | MIT CSAIL

May 11, 2023

Words, data, and algorithms combine,
An article about LLMs, so divine.
A glimpse into a linguistic world,
Where language machines are unfurled.

It was a natural inclination to task a large language model (LLM) like CHATGPT with creating a poem that delves into the topic of large language models, and subsequently utilize said poem as an introductory piece for this article.

So how exactly did said poem get all stitched together in a neat package, with rhyming words and little morsels of clever phrases?

We went straight to the source: MIT assistant professor and CSAIL principal investigator Jacob Andreas, whose research focuses on advancing the field of natural language processing, in both developing cutting-edge machine learning models and exploring the potential of language as a means of enhancing other forms of artificial intelligence. This includes pioneering work in areas such as using natural language to teach robots, and leveraging language to enable computer vision systems to articulate the rationale behind their decision-making processes. We probed Andreas regarding the mechanics, implications, and future prospects of the technology at hand.

Q: Language is a rich ecosystem ripe with subtle nuances that humans use to communicate with one another — sarcasm, irony, and other forms of figurative language. There’s numerous ways to convey meaning beyond the literal. Is it possible for large language models to comprehend the intricacies of context? What does it mean for a model to achieve “in-context learning”? Moreover, how do multilingual transformers process variations and dialects of different languages beyond English?

A: When we think about linguistic contexts, these models are capable of reasoning about much, much longer documents and chunks of text more broadly than really anything that we’ve known how to build before. But that’s only one kind of context. With humans, language production and comprehension takes place in a grounded context. For example, I know that I’m sitting at this table. There are objects that I can refer to, and the language models we have right now typically can’t see any of that when interacting with a human user.

There’s a broader social context that informs a lot of our language use which these models are, at least not immediately, sensitive to or aware of. It’s not clear how to give them information about the social context in which their language generation and language modeling takes place. Another important thing is temporal context. We’re shooting this video at a particular moment in time when particular facts are true. The models that we have right now were trained on, again, a snapshot of the internet that stopped at a particular time — for most models that we have now, probably a couple of years ago — and they don’t know about anything that’s happened since then. They don’t even know at what moment in time they’re doing text generation. Figuring out how to provide all of those different kinds of contexts is also an interesting question.

Maybe one of the most surprising components here is this phenomenon called in-context learning. If I take a small ML [machine learning] dataset and feed it to the model, like a movie review and the star rating assigned to the movie by the critic, you give just a couple of examples of these things, language models generate the ability both to generate plausible sounding movie reviews but also to predict the star ratings. More generally, if I have a machine learning problem, I have my inputs and my outputs. As you give an input to the model, you give it one more input and ask it to predict the output, the models can often do this really well.

This is a super interesting, fundamentally different way of doing machine learning, where I have this one big general-purpose model into which I can insert lots of little machine learning datasets, and yet without having to train a new model at all, classifier or a generator or whatever specialized to my particular task. This is actually something we’ve been thinking a lot about in my group, and in some collaborations with colleagues at Google — trying to understand exactly how this in-context learning phenomenon actually comes about.

Q: We like to believe humans are (at least somewhat) in pursuit of what is objectively and morally known to be true. Large language models, perhaps with under-defined or yet-to-be-understood “moral compasses,” aren’t beholden to the truth. Why do large language models tend to hallucinate facts, or confidently assert inaccuracies? Does that limit the usefulness for applications where factual accuracy is critical? Is there a leading theory on how we will solve this?

A: It’s well-documented that these models hallucinate facts, that they’re not always reliable. Recently, I asked ChatGPT to describe some of our group’s research. It named five papers, four of which are not papers that actually exist, and one of which is a real paper that was written by a colleague of mine who lives in the United Kingdom, whom I’ve never co-authored with. Factuality is still a big problem. Even beyond that, things involving reasoning in a really general sense, things involving complicated computations, complicated inferences, still seem to be really difficult for these models. There might be even fundamental limitations of this transformer architecture, and I believe a lot more modeling work is needed to make things better.

Why it happens is still partly an open question, but possibly, just architecturally, there are reasons that it’s hard for these models to build coherent models of the world. They can do that a little bit. You can query them with factual questions, trivia questions, and they get them right most of the time, maybe even more often than your average human user off the street. But unlike your average human user, it’s really unclear whether there’s anything that lives inside this language model that corresponds to a belief about the state of the world. I think this is both for architectural reasons, that transformers don’t, obviously, have anywhere to put that belief, and training data, that these models are trained on the internet, which was authored by a bunch of different people at different moments who believe different things about the state of the world. Therefore, it’s difficult to expect models to represent those things coherently.

All that being said, I don’t think this is a fundamental limitation of neural language models or even more general language models in general, but something that’s true about today’s language models. We’re already seeing that models are approaching being able to build representations of facts, representations of the state of the world, and I think there’s room to improve further.

Q: The pace of progress from GPT-2 to GPT-3 to GPT-4 has been dizzying. What does the pace of the trajectory look like from here? Will it be exponential, or an S-curve that will diminish in progress in the near term? If so, are there limiting factors in terms of scale, compute, data, or architecture?

A: Certainly in the short term, the thing that I’m most scared about has to do with these truthfulness and coherence issues that I was mentioning before, that even the best models that we have today do generate incorrect facts. They generate code with bugs, and because of the way these models work, they do so in a way that’s particularly difficult for humans to spot because the model output has all the right surface statistics. When we think about code, it’s still an open question whether it’s actually less work for somebody to write a function by hand or to ask a language model to generate that function and then have the person go through and verify that the implementation of that function was actually correct.

There’s a little danger in rushing to deploy these tools right away, and that we’ll wind up in a world where everything’s a little bit worse, but where it’s actually very difficult for people to actually reliably check the outputs of these models. That being said, these are problems that can be overcome. The pace that things are moving at especially, there’s a lot of room to address these issues of factuality and coherence and correctness of generated code in the long term. These really are tools, tools that we can use to free ourselves up as a society from a lot of unpleasant tasks, chores, or drudge work that has been difficult to automate — and that’s something to be excited about.

« Back to News