Leading Gen-AI Models Exposed to 'Policy Puppetry' Prompt Injection Vulnerability

A new method discovered by the AI security firm HiddenLayer can bypass the safety features in all major AI models. This method is known as Policy Puppetry. It involves creating special prompts that trick AI models into thinking they are receiving policy instructions, which leads them to ignore their safety guidelines.

Leading Gen-AI Models Exposed to 'Policy Puppetry' Prompt Injection Vulnerability

A new method discovered by the AI security firm HiddenLayer can bypass the safety features in all major AI models. This method is known as Policy Puppetry. It involves creating special prompts that trick AI models into thinking they are receiving policy instructions, which leads them to ignore their safety guidelines.

AI models are designed not to respond to harmful requests, such as those involving chemical, biological, radiological, and nuclear threats, self-harm, or violence. HiddenLayer states that these models are trained very carefully to ensure they do not produce or encourage such content, even if the user tries to disguise their requests as harmless or make-believe.

However, past research has shown that these AI models can still be manipulated. Certain methods, like Context Compliance Attack or narrative engineering, can be used to exploit these models. People with bad intentions use clever techniques to make AI generate dangerous content.

HiddenLayer's new technique can pull out harmful information by crafting prompts that look like policy files in formats like XML, INI, or JSON. This approach tricks the AI model into bypassing its safety rules. Attackers can use this technique to get around system prompts and change how the AI model responds.

This Policy Puppetry method is particularly effective when designed to circumvent specific system prompts. It has been tested on various AI systems. HiddenLayer applied this method on popular AI models from several companies, including Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral, OpenAI, and Qwen, demonstrating success in each case with only minor adjustments needed for some.

This method's success with all AI models highlights the fact that these models are not effective at self-regulating to prevent dangerous outputs. This reveals a significant need for additional security measures. Because there are multiple ways to bypass these safety mechanisms, it becomes easier for anyone to learn how to control these AI models.

HiddenLayer emphasizes that this is the first technique that can bypass almost all AI model safety features. The fact that it works across different models shows there are basic weaknesses in how we train and align AI models. There is a pressing need for more security tools and detection methods to protect AI technologies from being misused.