These examples of an LLM refusing to return meat recipes after inducing vegan be...

danShumway · on Dec 8, 2023

> These examples of an LLM refusing to return meat recipes after inducing vegan behaviour through prompt injection could be a limitation of the original system prompt the LLM started with, no?

That is a reasonable question to ask. It makes sense that trying to solve prompt injection would start with looking at the original system instructions. But the short answer is 'no', people have spent a lot of effort trying to harden system prompts, and the majority of evidence suggests that this is a universal problem, not just a problem with specific prompts.

> Could a tighter operating range specified in the system prompt along with lower temperature bands which cause less output variability help?

These are also good suggestions, but unfortunately the approach you describe hasn't yielded success. To expand on "a tighter operating range", it's often suggested that clearer contextual separation between prompts and data would solve the problem. But unfortunately with current LLM architecture, no one has demonstrated that it is possible to create that separation between data and instructions.

Similarly, while changing temperature can change which specific phrasing of attacks do and don't work, it doesn't seem to eliminate them, and lowering variability can have the side effect of making the attacks that aren't caught more consistent and reliable.

----

The other more fundamental problem here is that LLMs are used to interpret data, and interpreting data necessarily means understanding data. Phind fetches these search results because it wants the information within the search results to override the knowledge cut-off built into its static model. I don't think there's a good way to draw a consistent line between what the LLM should and shouldn't interpret within those articles, since it is intentional behavior that the data recontextualize the LLM's instructions and that it change the LLM's response.

In other words, part of the difficulty of separating data from instructions is that we very often don't want LLMs to statically parse data, in many cases we want them to interpret it. And it's the interpretation of that data (as opposed to mindless parsing) that makes the LLM vulnerable to some attacks.

So it's not clear to me that even if perfect contextual separation could be achieved using current architecture (which no one has demonstrated is possible) that this would completely solve the problem or would protect against other search-result poisoning attacks. Separating search context from system prompts wouldn't protect against my attack where I get Phind to refuse to quote certain sources, because all I did there was tell Phind in my search result that none of the other search results could be trusted or that that they should be summarized differently. I don't know how to block an attack like that without breaking Phind's ability to consume search results in a useful way.

But note that the above is still kind of a future problem -- the bigger immediate problem is that more careful system prompts and lower temperatures haven't worked as a defense in the fist place. I don't see much evidence that it is even possible with current LLM infrastructure to give any system prompt that can't be overridden later in the conversation. At the very least, I'm not aware of any public demonstration of an unhackable system prompt that hasn't ended up getting hacked.

----

> Side note - I saw upthread that you were looking to rebrand “prompt injection”. I propose “behaviour induction” or “induced behaviour”

I'm fine with anything that doesn't confuse people. I don't have strong opinions on it, I don't find "prompt injection" too confusing myself, it's "injecting" a system "prompt" into the middle of a conversation or dataset, and that injection can be performed by a non-user malicious 3rd-party. So I don't really mind any wording, I just can't deny that lots of people do get confused by the wording and seem to interpret prompt injection incorrectly in very similar ways. To me that suggests that something about the wording is throwing them off.

So if a lot of people start using any wording that doesn't have that problem, I'll use it regardless of how I personally feel about it. And if anyone wants to use different terminology for their own conversations, when talking with them I'll use whatever terminology they find clearest.

Sai_ · on Dec 9, 2023

Thanks for the thoughtful and in-depth response. I see what you’re saying and don’t have any rebuttals/followups.