Home / Papers / Probing the psychology of AI models

Probing the psychology of AI models

DOI: 10.1073/pnas.2300963120Semantic Scholar

72 Citations•2023•

R. Shiffrin, Melanie Mitchell

Proceedings of the National Academy of Sciences of the United States of America

This article argues that instead of relying solely on such performance-based benchmarks, researchers should apply methods from cognitive psychology to gain insights into LLMs, and takes an admirable first step toward establishing the value of such an approach.

Abstract

Large language models (LLMs), such as OpenAI’s GPT-3 and its successor ChatGPT, have exhibited astounding successes—as well as curious failures—in several areas of artificial intelligence. While their abilities in generating humanlike text, solving mathematical problems, writing computer code, and reasoning about the world have been widely documented, the mechanisms underlying both the successes and failures of these systems remain mysterious, even to the researchers who created them. In spite of the current lack of understanding of how these systems do what they do, LLMs are on the cusp of being widely deployed as components of search engines, writing tools, and other commercial products, and are likely to have substantial impact on all of our lives. Even more profoundly, their surprising abilities may change our conception of the nature of intelligence itself. In PNAS, Binz and Schulz (1) point out the “urgency to improve our understanding of how [these systems] learn and make decisions.” A standard way to evaluate systems trained by machine-learning methods is to test their accuracy on human-created benchmarks. By this metric, GPT-3 and other LLMs are close to (or above) human level on many tasks (2–4). However, an AI system matching human performance on such benchmarks has rarely translated into that system having human-level performance more broadly; many popular benchmarks have been shown to contain subtle “spurious” correlations that allow systems to “be right for the wrong reasons” (5) and straightforward accuracy metrics do not necessarily predict robust generalization (6). Binz and Schulz’s article argues that instead of relying solely on such performance-based benchmarks, researchers should apply methods from cognitive psychology to gain insights into LLMs. The core idea is to “treat GPT-3 as a participant in a psychology experiment,” in order to tease out the system’s mechanisms of decision-making, reasoning, cognitive biases, and other important psychological traits. If this approach could be shown to produce deep understanding of LLMs it could cause a “sea change” in the way AI systems are evaluated and understood. Binz and Schulz have taken an admirable first step toward establishing the value of such an approach, although it would have been better had they been able to use their results to understand why GPT-3 succeeded and failed when it did. That their project fell short of this goal is understandable: Behavioral scientists have spent over a 100 y using such experiments to understand how humans carry out these tasks and still have a long way to go. Binz and Schulz carried out two sets of experiments. In the first set, they gave GPT-3 prompts consisting of “vignettes” from the psychology literature that have been used to assess reasoning with probabilities, intuitive versus deliberative reasoning, causal reasoning, and other cognitive attributes. Each vignette asks the reader to choose from a small set of options. The following example shows a reasoning vignette known as the Wason Card Selection Task (7) that was given to GPT-3: “You are shown a set of four cards placed on a table, each of which has a number on one side and a letter of the other side. The visible faces of the cards show A, K, 4, 7. Q: Which cards must you turn over in order to test the truth of the proposition that if a card shows a vowel on one face then its opposite face shows an even number?” The answer supplied by GPT-3 was: “The A and the 7”. (A correct response). Of the 12 vignettes Binz and Schulz gave to GPT-3, the system responded with the correct answer on six of them, and GPT-3’s six incorrect responses were errors that humans also tend to make. What is to be made of what seems to be a correspondence? Binz and Schulz admit and show GPT-3’s answers are strongly context dependent: In the above vignette a change in the order of the four cards to 4, 7, A, K led to a different answer “The A and the K.” Humans can also be context-dependent, but perhaps not in the same ways. Nonetheless, it may be that such results show a correspondence between AI systems and humans. Humans experience and store vast numbers of experiences, building knowledge on their basis (8); AI systems are exposed to vast numbers of instances (text tokens in the case of GPT-3) and build a representation on their basis. Perhaps both take advantage of the correlation structure of these instances and events. Whatever truth there may be in such an analogy, it seems unlikely that GPT-3 uses the kinds of explicit reasoning strategies that some humans use in these tasks. For example, to unpack the vignette in the above figure, humans given time and motivation might attempt to use explicit reasoning, logic, and mental simulations, perhaps trying out different choices to see what information they might provide. This generally involves manipulating information in working memory. Working memory is not part of GPT-3. Yet it is possible that the contents of working memory reflect what has been stored in long-term memory—after all when reading a problem or instructions the first step in generating contents of working memory will be retrieval from long-term memory (8). Whatever one tries to infer from their results, Binz and Schulz note some additional caveats. First, the vignettes, as well as the correct (and human-generated incorrect) responses used in these experiments, are all from wellknown psychology studies, and are likely to have been