Global reinsurance broker and advisory firm Gallagher Re has highlighted the need for more advanced methods to evaluate artificial intelligence systems in order to increase insurers’ confidence when pricing AI-related risks.
In its independent report, Anthropic’s Fourth Way: Why constrained AI models are a challenge for insurance companies, Gallagher Re noted that current assessment methods were not originally developed for underwriting purposes and tend to prioritize measured performance over operational behavior under real-world conditions.
“They show what a model can do under controlled circumstances, but what insurers care about is how the model fails, how often it fails and whether those failures can be correlated across the portfolio,” commented Ed Pocock, global head of cybersecurity at Gallagher Re, highlighting the disconnect between benchmarking and insurance-focused risk assessments.
Gallagher Re explained that AI models are typically evaluated through benchmarks, which are standardized tests designed to compare performance on fixed tasks. While these are helpful in controlled environments, the company notes that once deployed in practice, they do not fully represent how systems will behave when exposed to uncertain, complex, or unpredictable inputs.
It added that strong baseline performance does not eliminate issues such as hallucinations, inconsistent responses or subtle glitches that may not be immediately visible. The company also noted that existing assessment techniques do not properly account for concentration risks, particularly where underlying models that are widely used across multiple insured organizations may fail.
The report also draws attention to benchmark pollution, where models are increasingly optimized to perform well on the tests used to evaluate them. Gallagher Re warned that this could artificially inflate the report’s score and undermine its value as a true indicator of operational reliability. It also suggests that this effect may reduce meaningful differences between systems and increase systemic concentration risks. Pocock added: “This could remove useful differences between systems and increase concentration risks.”
Gallagher Re further examines the emergence of restricted distribution of AI models, referencing the Mythos model released by Anthropic under its Project Glasswing initiative, which is shared only with a select group of approved partners rather than being widely accessible.
The company describes it as a potential fourth category of cutting-edge AI distribution, alongside open source, open weight and proprietary models. It believes that such limitations may limit independent assessment, which is important for insurers seeking to understand the performance of different real-world applications.
While the UK AI Security Institute has assessed the myths and published its findings, Gallagher Re insists that insurers need wider independent access to support accurate risk pricing. “If a model cannot be evaluated independently, meaningful pricing cannot be achieved,” Pocock said, adding, “Insurers may end up absorbing uncertainty rather than reflecting actual risk. This increases costs for everyone and slows down the market.”
Gallagher Re recommends moving to evaluation methods that better reflect how AI systems perform in practice, including testing with realistic inputs, adversarial scenarios, and ongoing monitoring as models evolve over time. It highlights the importance of assessing hallucination frequency, decision stability, failure characteristics, and likelihood of related failures across deployments.
The report also points to early progress from groups like Epoch AI and Artificial Analysis, which are developing more powerful assessment techniques that are harder to game and more reflective of real-world performance. Gallagher Re believes the reinsurance industry can help shape the development of AI by influencing standards through underwriting requirements, pricing structures and coverage design to encourage greater transparency and system resilience.
Pocock further added: “Better assessments provide markets with the tools to reward transparency and robustness,” warning, “Without it, we risk not having scale and brand as a proxy for safety, which could amplify the concentration risks we need to manage.”