Insiders LLM Bench­mar­king September 2025

The Insiders LLM Bench­mar­king in September 2025 continues the series and builds con­sis­t­ently on the findings from Q2. To ensure com­pa­ra­bi­lity, identical dimen­sions and test data are used as in the previous bench­mar­king.

The market for large language models (LLMs) is deve­lo­ping rapidly. New models appear on a monthly basis, existing ones are further optimized—and not all of them prove them­selves in practice. With the current Insiders LLM Bench­mar­king for Q3 2025, we create trans­pa­rency and provide companies with sound guidance: Which models deliver the best quality? What are the limi­ta­tions in pro­duc­tive use? And how can per­for­mance and security be recon­ciled?

 

A practical com­pa­rison

As in Q2, we tested the leading models based on a stan­dar­dized IDP dataset – real documents from insurance and finance. This ensures that the results are directly trans­ferable to our customers‘ requi­re­ments. The bench­mar­king covers a total of 21 models, including new additions such as GPT‑5, Gemini 2.5 Pro, and Claude 4 Sonnet.

The com­pa­rison shows that global models set the benchmark thanks to their huge databases and computing resources. However, in regulated indus­tries in par­ti­cular, data pro­tec­tion, trans­pa­rency, and inte­gra­tion capa­bi­li­ties are just as crucial as pure per­for­mance.

By switching to a more powerful model, Insiders Private was able to achieve a signi­fi­cant leap in quality: from a score of 67.9 in Q2 to 78.2 now – while main­tai­ning the same average pro­ces­sing time per document. This brings it closer to the top models without com­pro­mi­sing on data pro­tec­tion or speed.

The current Insiders LLM bench­mar­king illus­trates that Insiders con­ti­nuously monitors the market and masters the balancing act between per­for­mance and security for its customers – with a clear best-of-breed approach. This approach means that no single model covers all tasks, but rather that the most powerful LLMs are iden­ti­fied, evaluated, and flexibly inte­grated for each appli­ca­tion. New models are therefore imme­dia­tely tested in bench­mar­king and compared with existing ones. The results flow directly into product deve­lo­p­ment and ensure con­sis­t­ently high quality.

The question of “the best LLM” is not a black-and-white issue. Per­for­mance alone is not enough. In highly regulated indus­tries such as insurance and finance, relia­bi­lity, data pro­tec­tion, and inte­gra­tion capa­bi­li­ties are also key factors.

For indi­vi­dual use cases, Insiders AI experts offer sound advice for your company. We would be happy to include your data in an upcoming industry-specific bench­mar­king exercise. Simply contact our Insiders AI experts to find out more.