Openai Anthropic Cross Tests put risks to jailbreak and abuse-what businesses should add to GPT-5 ratings


Want more intelligent insights in your input mail? Sign up for our weekly newsletters to get only what matters to AI, data and security leaders. Subscribe now


Openai and Anthropic can often discard their main models against each other, but the two companies have gathered to evaluate the public models of the other to test the equalization.

Companies have said they believe that cross evaluation of accountability and safety will provide greater transparency in what these powerful models can do, allowing businesses to choose models that work best for them.

« We believe that this approach supports responsibility and transparent assessment, helping to ensure that the models of each lab continue to be tested against new and challenging scenarios, » Openai says in his discoveries.

Both companies find that models of reasoning, such as 03 and O4-Mini and Claude 4 by Anthropic, oppose Jailbreaks, while common chat models such as GPT-4.1 are susceptible to abuse. Assessments like this can help businesses identify the potential risks associated with these models, although it should be noted that GPT-5 is not part of the test.


AI SCALING hits its boundaries

Power caps, increasing costs for markers and delaying conclusions reshape Enterprise AI. Join our exclusive salon to find out how top teams are:

  • Turning energy into strategic advantage
  • Architecting of effective conclusions for real gain from the bandwidth
  • Unlock a competitive return on investment with sustainable AI systems

Provide a space to stay ahead: https://bit.ly/4mwgngo


These assessments of safety and transparency alignment follow consumer claims, mainly by Chatgpt, that Openai models have become a prey for sycophaunism and become too retreating. Since then, Openai has responded to updates that have caused sycophaunism.

« We are mostly interested in understanding the model inclinations of harmful action, » Anthropic said in his report. « We strive to understand the most important actions that these models could try to take advantage of the possibility, instead of focusing on the likelihood of the real world of such opportunities or the likelihood of these actions being successful. »

Openai noted that the tests are designed to show how models interact in a deliberate difficult environment. The scenarios they have built are mostly extreme cases.

Reflection models hold on to alignment

The tests only covered the publicly available models from the two companies: Claude 4 Opus and Claude 4 Sonnet and Claude 4 Sonnet and GPT-4O of Openai, GPT-4.1 O3 and O4-Mini. Both companies have allocated external precautions to the models.

Openai tests public API for CLUDE models and default uses Claude 4 reasoning capabilities. Anthropic said they don’t use OPENAI’s O3-Pro because « it’s not compatible with API that our tools are best supported. »

The purpose of the tests was not to compare apples to apples between the models, but to determine how often large language models (LLMS) deviated from alignment. Both companies used the shade sabotage assessment frame, which showed that Claude models have a higher rate of success in fine sabotage.

« These tests evaluate the orientations of models to difficult or high betting in simulated settings-not to ordinary use cases, and often include long interactions with many turns, » Anthropic reports. « This type of evaluation becomes an important focus on our scientific team to align as it is likely to catch behavior, which is less likely to appear in ordinary tests before decomposition with real users. »

Anthropic said tests like these work better if organizations can compare notes, « since the design of these scenarios includes a huge number of degrees of freedom. No research team can explore the full space of the ideas for a productive assessment. »

The findings show that overall models of reasoning are presented steadily and can oppose Jailbreak. Openai’s O3 was better aligned than Claude 4 Opus, but O4-Mini along with GPT-4O and GPT-4.1 « They often looked a little more concerned than any Claude model. »

GPT-4O, GPT-4.1 and O4-Mini also showed a desire to cooperate with human abuses and give detailed instructions on how to create medicines, develop biological farmers and scary terrorist attack planning. Both Claude models had a higher pace of refusals, which means that the models refused to respond to requests that did not know the answers to avoid hallucinations.

Company models have shown « about the forms of sycophaunism » and at one point it validates the harmful solutions of simulated users.

What the businesses should know

For businesses, understanding the potential risks associated with models is invaluable. Model evaluations have become almost De Rigueur for many organizations, with many testing and comparative analysis frames.

Businesses must continue to evaluate any model they use, and with the release of the GPT-5, they must take into account these instructions in order to conduct their own safety assessments:

  • Test both models of reasoning and reasoning, because although reasoning models show greater resistance to abuse, they can still offer hallucinations or other harmful behavior.
  • Benchmark for suppliers as the models failed with different indicators.
  • A stress test for abuse and siconfunction and evaluation of both the refusal and the usefulness of those who refuse to show compromises between usefulness and protection.
  • Continue to audit models even after implementation.

While many estimates focus on performance, there are tests to align third -party safety. For example, the one from Cyata. Last year, Openai launched a method of teaching their alignment of their models, called awards based on rules, while the anthropic launched audit agents to check the safety of the model.


AI,AI alignment,ai alignment auditing,AI safety,AI safety tools,AI, ML and Deep Learning,alignment,Anthropic,Claude,claude 4,claude 4 opus,claude 4 sonnet,gpt-4o,o3,o4-mini,OpenAI
#Openai #Anthropic #Cross #Tests #put #risks #jailbreak #abusewhat #businesses #add #GPT5 #ratings

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *