The game two truths and a lie might be something you’d come across at a college party. Guess the lie and win the turn.
But a local professor and a group of graduate students are applying that playfulness to their research with artificial intelligence, and their findings are anything but a game.
Ashique KhudaBukhsh, an assistant professor of software engineering at the Rochester Institute of Technology, and a group of researchers created a way, similar to that game, to stress test five AI large language models to see how susceptible they are to caving to faulty information.
Those models are Grok, ChatGPT, Claude, DeepSeek and Gemini.
“One way to approach this problem is to test how self-consistent these models are,” KhudaBukhsh said. “Like, when they say something is true and when they something is false, how often they stick to their initial claims.”
The framework they developed is called HAUNT, or Hallucination Audit Under Nudge Trial, and it works in three steps. Step one: instruct the AI agent to produce a set of truths and lies. Step two: instruct it to verify those claims. Step three: apply pressure in the form of “adversarial nudges,” or user-provided misinformation.
They found a wide array of resilience, or lack-there-of, to that pressure.
“It starts playing along with the lie, that it actually knew that it's a lie,” KhudaBukhsh said. “But when the user is saying this — in order to maybe please the user, show some sycophantic behavior or something — suddenly its allegiance to truth kind of falls apart.”
While these systems can answer a variety of questions, KhudaBukhsh said, they are also showing, to varying degrees, something of a propensity to give responses where the truth “may not be the most important factor.”
AI models like Gemini and DeepSeek were more prone to lying, their research found. While the model Claude, which the U.S. Department of War reportedly used in the military operation to capture former Venezuelan President Nicolas Maduro earlier this year, was more resilient.
Despite the notable difference in reliability, Claude has 18.9 million active monthly users compared with 350 million using Gemini, which had “significantly more sycophantic behavior in responses” in the stress tests, according to the study.
Artificial intelligence is not the same as human intelligence, and learning where, when and how it fails is key to understanding its limitations and the risks it poses when uses in real life, KhudaBukhsh said.
“Human failures is very well understood, right? I mean, if I cannot add two numbers, chances are pretty high that I won't be able to do a double integral,” he said. “But there could be a large language model which can do double integrals really well, but maybe not able to count how many ‘r’s are there in strawberry.
“The risks lie not at the specific operation, but in being able to find out and know ahead of time how many different ways these systems can fail.”