How enterprise IT can protect itself from genAI unreliability

Mesmerized by the scalability, efficiency and flexibility claims from generative AI (genAI) vendors, enterprise execs have been all but tripping over themselves trying to push the technology to its limits.

The fear of flawed deliverables — based on a combination of hallucinations, imperfect training data and a model that can disregard query specifics and can ignore guardrails — is usually minimized.

But the Mayo Clinic is trying to push back on all those problematic answers.

In an interview with VentureBeat, Matthew Callstrom, Mayo’s medical director, explained: “Mayo paired what’s known as the clustering using representatives (CURE) algorithm with LLMs and vector databases to double-check data retrieval.

“The algorithm has the ability to detect outliers or data points that don’t match the others. Combining CURE with a reverse RAG approach, Mayo’s [large language model] split the summaries it generated into individual facts, then matched those back to source documents. A second LLM then scored how well the facts aligned with those sources, specifically if there was a causal relationship between the two.”

(Computerworld reached directly to Callstrom for an interview, but he was not available.)

There are, broadly speaking, two categories for reducing genAI’s lack of reliability: humans in the loop (usually, an awful lot of humans in the loop) or some version of AI watching AI.

The idea of having more humans monitoring what these tools deliver is typically seen as the safer approach, but it undercuts the key value of genAI — massive efficiencies. Those efficiencies, the argument goes, should allow workers to be redeployed to more strategic work or, as the argument becomes a whisper, to sharply reduce that workforce.

But at the scale of a typical enterprise, genAI efficiencies could replace the work of thousands of people. Adding human oversight might only require dozens of humans. It still makes mathematical sense.

The AI-watching-AI approach is scarier, although a lot of enterprises are giving it a go. Some are looking to push any liability down the road by partnering with others to do their genAI calculations for them. Still others are looking to pay third-parties to come in and try and improve their genAI accuracy. The phrase “throwing good money after bad” immediately comes to mind.

The lack of effective ways to improve genAI reliability internally is a key factor in why so many proof-of-concept trials got approved quickly, but never moved into production.

Some version of throwing more humans into the mix to keep an eye on genAI outputs seems to be winning the argument, for now. “You have to have a human babysitter on it. AI watching AI is guaranteed to fail,” said Missy Cummings, a George Mason University professor and director of Mason’s Autonomy and Robotics Center (MARC).

“People are going to do it because they want to believe in the (technology’s) promises. People can be taken in by the self-confidence of a genAI system,” she said, comparing it to the experience of driving autonomous vehicles (AVs).

When driving an AV, “the AI is pretty good and it can work. But if you quit paying attention for a quick second,” disaster can strike, Cummings said. “The bigger problem is that people develop an unhealthy complacency.”

Rowan Curran, a Forrester senior analyst, said Mayo’s approach might have some merit. “Look at the input and look at the output and see how close it adheres,” Curran said.

Curran argued that identifying the objective truth of a response is important, but it’s also important to simply see whether the model is even attempting to directly answer the query posed, including all of the query’s components. If the system concludes that the “answer” is non-responsive, it can be ignored on that basis.

Another genAI expert is Rex Booth, CISO for identity vendor Sailpoint. Booth said that simply forcing LLMs to explain more about their own limitations would be a major help in making outputs more reliable.

For example, many — if not most — hallucinations happen when the model can’t find an answer in its massive database. If the system were set up to simply say, “I don’t know,” or even the more face-saving, “The data I was trained on doesn’t cover that,” confidence in outputs would likely rise.

Booth focused on how current data is. If a question asks about something that happened in April 2025 — and the model knows its training data was last updated in December 2024 — it should simply say that rather than making something up. “It won’t even flag that its data is so limited,” he said.

He also said that the concept of “agents checking agents” can work well — provided each agent is assigned a discrete task.

But IT decision-makers should never assume those tasks and that separation will be respected. “You can’t rely on the effective establishment of rules,” Booth said. “Whether human or AI agents, everything steps outside the rules. You have to be able to detect that once it happens.”

Another popular concept for making genAI more reliable is to force senior management — and especially the board of directors — to agree on a risk tolerance level, put it in writing and publish it. This would ideally push senior managers and execs to ask the tough questions about what can go wrong with these tools and how much damage they could cause.

Reece Hayden, principal analyst with ABI Research, is skeptical about how much senior management truly understands genAI risks.

“They see the benefits and they understand the 10% inaccuracy, but they see it as though they are human-like errors: small mistakes, recoverable mistakes,” Hayden said. But when algorithms go off track, they can make errors light years more serious than humans.

For example, humans often spot-check their work. But “spot-checking genAI doesn’t work,” Hayden said. “In no way does the accuracy of one answer indicate the accuracy of other answers.”

It’s possible the reliability issues won’t be fixed until enterprise environments adapt to become more technologically hospitable to genAI systems.

“The deeper problem lies in how most enterprises treat the model like a magic box, expecting it to behave perfectly in a messy, incomplete and outdated system,” said Soumendra Mohanty, chief strategy officer at AI vendor Tredence. “GenAI models hallucinate not just because they’re flawed, but because they’re being used in environments that were never built for machine decision-making. To move past this, CIOs need to stop managing the model and start managing the system around the model. This means rethinking how data flows, how AI is embedded in business processes, and how decisions are made, checked and improved.”

Mohanty offered an example: “A contract summarizer should not just generate a summary, but it should validate which clauses to flag, highlight missing sections and pull definitions from approved sources. This is decision engineering defining the path, limits, and rules for AI output, not just the prompt.”

There is a psychological reason execs tend to resist facing this issue. Licensing genAI models is stunningly expensive. And after making a massive investment in the technology, there’s natural resistance to pouring even more money into it to make outputs reliable.

And yet, the whole genAI game has to be focused on delivering the goods. That means not only looking at what works, but dealing with what doesn’t. There’s going to be a substantial cost to fixing things when these erroneus answers or flawed actions are discovered.

It’s galling, yes; it is also necessary. The same people who will be praised effusively about the benefits of genAI will be the ones blamed for errors that materialize later. It’s your career — choose wisely.