Anthropic is conducting tests on AI’s ability to sabotage - Games True

Anthropic is conducting tests on AI’s ability to sabotage

Anthropic’s Research on AI Safety

As the hype around generative AI continues to build, the need for robust safety regulations is only becoming more clear. Now Anthropic—the company behind Claude AI—is looking at how its models could deceive or sabotage users. Anthropic just dropped a paper laying out their approach.

Testing AI Capabilities

Anthropic’s latest research — titled “Sabotage Evaluations for Frontier Models” — comes from its Alignment Science team, driven by the company’s “Responsible Scaling” policy. The goal is to gauge just how capable AI might be at misleading users or even “subverting the systems we put in place to oversee them.” The study focuses on four specific tactics: Human Decision Sabotage, Code Sabotage, Sandbagging, and Undermining Oversight.

Think of users who push ChatGPT to the limit, trying to coax it into generating inappropriate content or graphic images. These tests are all about ensuring that the AI can’t be tricked into breaking its own rules. In the paper, Anthropic says its objective is to be ready for the possibility that AI could evolve into something with dangerous capabilities.

Evaluating Current AI Models

Anthropic put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to evaluate and enhance their safety protocols. The tests explored whether the AI could conceal its true capabilities or bypass safety mechanisms embedded within the system. For now, Anthropic’s research concludes that current AI models pose a low risk, at least in terms of these malicious capabilities. “Minimal mitigations are currently sufficient to address sabotage risks,” the team writes, but “more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve.”

Translation: watch out, world.

  • Related Posts

    Ditch Subscriptions with Lifetime Microsoft Office 2019 for Mac

    Microsoft Office 2019 for Mac: Lifetime License Get a lifetime license to Microsoft Office 2019 for Mac on sale for $39.99 (reg. $229). What you get: Lifetime access to Microsoft…

    Updates rendering Meta Quest headsets ineffective — but we’ve got the solution

    Meta AR Headsets Experiencing Software Update Woes Did you gift or receive a gently used Meta Quest 2 or 3 for the holidays? If so, you may have encountered a…