Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

People should assume the prompt is able to be leaked. There should not be secret information the user of the LLM should not have access too.


Prompt injection allows 3rd-party text which the user may not have validated to give LLMs malicious instructions against the wishes of the user. The name "prompt injection" often confuses people, but it is a much broader category of attack than jailbreaking or prompt leaking.

> the "personal AI assistant" is the best example, since prompt injection means that any time an LLM has access to both private data and untrusted inputs (like emails it has to summarize) there is a risk of something going wrong: https://simonwillison.net/2023/May/2/prompt-injection-explai...

Simon's article here is a really good resource for understanding more about prompt injection (and his other writing on the topic is similarly quite good). I would highly recommend giving it a read, it does a great job of outlining some of the potential risks.


It should be interpreted similarly as SQL injection.

If an LLM has access to private data and is vulnerable to prompt injection, the private data can be compromised.


> as similarly as SQL injection.

I really like this analogy, although I would broaden it -- I like to equate it more to XSS: 3rd-party input can change the LLM's behavior, and leaking private data is one of the risks but really any action or permission that the LLM has can be exploited. If an LLM can send an email without external confirmation, than an attacker can send emails on the user's behalf. If it can turn your smart lights on, then a 3rd-party attacker can turn your smart lights on. It's like an attacker being able to run arbitrary code in the context of the LLM's execution environment.

My one caveat is that I promised someone a while back that I would always mention when talking about SQL injection that defending against prompt injection is not the same as escaping input to an SQL query or to `innerHTML`. The fundamental nature of why models are vulnerable to prompt injection is very different from XSS or SQL injection and likely can't be fixed using similar strategies. So the underlying mechanics are very different from an SQL injection.

But in terms of consequences I do like that analogy -- think of it like a 3rd-party being able to smuggle commands into an environment where they shouldn't have execution privileges.


I totally agree with you. I use the analogy exactly because of the differences in the solution to it that you point out and that, at this point, it seems like an impossible problem to solve.

The only solution is to not allow LLMs access to private data. It's definitely a "garden path" analogy meant to lead to that conclusion.


The biggest risk to that security risk is its own name. Needs rebranding asap.


:) You're definitely not the first person to suggest that, and there is a decent argument to be made for rebranding. I'm not opposed to it. And I have seen a few at least individual efforts to use different wording, but unfortunately none of them seem to have caught on more broadly (yet), and I'm not sure if there's a clear community consensus yet among security professionals about what they'd prefer to use instead (people who are more embedded in that space than me are welcome to correct me if wrong on that).

But I'm at least happy to jump to other terminology if that changes, I do think that calling it "prompt injection" confuses people.

I think I remember there being some effort a while back to build a more extensive classification of LLM vulnerabilities that could be used for vulnerability reporting/triaging, but I don't know what the finished project ended up being or what the full details were.


Just call it what it is-- social engineering (really, manipulation).

"Injection" is a narrow and irrelevant definition. Natural language does not follow a bounded syntax, and injection of words is only one way to "break" the LLM. Buffer overflow works just as well-- smalltalk it to death, until the context outweighs the system prompt. Use lots of innuendo and ambiguous verbiage. After enough discussion of cork soakers and coke sackers you can get LLMs to alliterate about anything. There's nothing injected there, it's just a conversation that went a direction you didn't want to support.

In meatspace, if you go to a bank and start up with elaborate stories about your in-laws until the teller forgets what you came in for, or confuse the shit out of her by prefacing everything you say with "today is opposite day," or flash a fake badge and say you're Detective Columbo and everybody needs to evacuate the building, you've successfully managed to get the teller to break protocol. Yet when we do it to LLMs, we give it the woo-woo euphemism "jailbreaking" as though all life descended from iPhones.

When the only tool in your box is a computer, every problem is couched in software. It smells like we're trying to redefine manipulation, which does little to help anybody. These same abuses of perception have been employed by and against us for thousands of years already under the names of statecraft, spycraft and stagecraft.


I think you may be confusing jailbreaking and prompt injection.

Jailbreaking is more akin to social engineering - it's when you try and convince the model to do something it's "not supposed" to do.

Prompt injection is a related but different thing. It's when you take a prompt from a developer - "Translate the following from English to French:" - and then concatenate on a string of untrusted text from a user.

That's why it's called "prompt injection" - it's analogous to SQL injection, which was caused by the same mistake, concatenating together trusted instructions with untrusted input.


Seems directly analogous to SQL injection, no?


Almost. That's why I suggested the name "prompt injection" - because both attacks involve concatenating together trusted and untrusted text.

The problem is that SQL injection has an easy fix: you can use parameterized queries, or correctly escape the untrusted content.

When I coined "prompt injection" I assumed the fix would look the same. 14 months later it's abundantly clear that implementing an equivalent of those fixes for LLMs is difficult to the point of maybe being impossible, at least against current transformer-based architectures.

This means the name "prompt injection" may de-emphasize the scale of the threat!


That makes a ton of sense. Well, keen to hear what you (or The People) come up with as a more suitable alternative.


Same that any scandal is analogous to Watergate (hotel). It makes no sense but since it sounds cool now people will run with it forever.


Not really


I agree, but leaked prompts are by far the least consequential impact of the prompt injection class of attacks.


What are ANY consequential impacts of prompt injection other that the user is able to get information out of the LLM that was put into the LLM?

I can not understand what the concern is. Like if something is indexed by Google, that means it might be available to find through a search, same with an LLM.


> What are ANY consequential impacts of prompt injection other that the user is able to get information out of the LLM that was put into the LLM?

The impact of prompt injection is provoking arbitrary, unintended behavior from the LLM. If the LLM is a simple chatbot with no tool use beyond retrieving data, that just means “retrieving data different than the LLM operator would have anticipated” (and possibly the user—prompt injection can be done by data retrieved that the user doesn't control, not just the user themselves, because all data processed by the LLM passes through as part of a prompt).

But if the LLM is tied into a framework where it serves as an agent with active tool use, then the blast radius of prompt injection is much bigger.

A lot of the concern about prompt injection isn't about currently popular applications of LLMs, but the applications that have been set out as near term possibilities that are much more powerful.


Exactly this. Prompt injection severity varies depending on the application.

The biggest risk come from applications that have tool access, but applications that can access private data have a risk too thanks to various data exfiltration tricks.


I've written a bunch about this:

- Prompt injection: What’s the worst that can happen? https://simonwillison.net/2023/Apr/14/worst-that-can-happen/

- The Dual LLM pattern for building AI assistants that can resist prompt injection https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

- Prompt injection explained, November 2023 edition https://simonwillison.net/2023/Nov/27/prompt-injection-expla...

More here: https://simonwillison.net/series/prompt-injection/


> the user is able to get information out of the LLM that was put into the LLM?

Roughly:

A) that somebody other than the user might be able to get information out of the LLM that the user (not the controlling company) put into the LLM.

For example, in November https://embracethered.com/blog/posts/2023/google-bard-data-e... demonstrated a working attack that used malicious Google Docs to exfiltrate the contents of user conversations with Bard to a 3rd-party.

B) that the LLM might be authorized to perform actions in response to user input, and that someone other than the user might be able to take control of the LLM and perform those actions without the user's consent/control.

----

Don't think of it as "the user can search for a website I don't want them to find." Think of it as, "any individual website that shows up when the user searches can now change the behavior of the search engine."

Even if you're not worried about exfiltration, back in Phind's early days I built a few working proof of concepts (but never got the time to write them up) where I used the context that Phind was feeding into prompts through Bing searches to change the behavior of Phind and to force it to give inaccurate information, incorrectly summarize search results, or to refuse to answer user questions.

By manipulating what text was fed into Phind as the search context, I was able to do things like turn Phind into a militant vegan that would refuse to answer any question about how to cook meat, or would lie about security advice, or would make up scandals about other search results fed into the summary and tell the user that those sites were untrustworthy. And all I needed to get that behavior to trigger was to insert a malicious prompt into the text of the search results, any website that showed up in one of Phind's searches could have done the same. The vulnerability is that anything the user can do through jailbreaking, a 3rd-party can do in the context of a search result or some source code or an email or a malicious Google Doc.


These examples of an LLM refusing to return meat recipes after inducing vegan behaviour through prompt injection could be a limitation of the original system prompt the LLM started with, no?

Could a tighter operating range specified in the system prompt along with lower temperature bands which cause less output variability help?

Side note - I saw upthread that you were looking to rebrand “prompt injection”. I propose “behaviour induction” or “induced behaviour”


> These examples of an LLM refusing to return meat recipes after inducing vegan behaviour through prompt injection could be a limitation of the original system prompt the LLM started with, no?

That is a reasonable question to ask. It makes sense that trying to solve prompt injection would start with looking at the original system instructions. But the short answer is 'no', people have spent a lot of effort trying to harden system prompts, and the majority of evidence suggests that this is a universal problem, not just a problem with specific prompts.

> Could a tighter operating range specified in the system prompt along with lower temperature bands which cause less output variability help?

These are also good suggestions, but unfortunately the approach you describe hasn't yielded success. To expand on "a tighter operating range", it's often suggested that clearer contextual separation between prompts and data would solve the problem. But unfortunately with current LLM architecture, no one has demonstrated that it is possible to create that separation between data and instructions.

Similarly, while changing temperature can change which specific phrasing of attacks do and don't work, it doesn't seem to eliminate them, and lowering variability can have the side effect of making the attacks that aren't caught more consistent and reliable.

----

The other more fundamental problem here is that LLMs are used to interpret data, and interpreting data necessarily means understanding data. Phind fetches these search results because it wants the information within the search results to override the knowledge cut-off built into its static model. I don't think there's a good way to draw a consistent line between what the LLM should and shouldn't interpret within those articles, since it is intentional behavior that the data recontextualize the LLM's instructions and that it change the LLM's response.

In other words, part of the difficulty of separating data from instructions is that we very often don't want LLMs to statically parse data, in many cases we want them to interpret it. And it's the interpretation of that data (as opposed to mindless parsing) that makes the LLM vulnerable to some attacks.

So it's not clear to me that even if perfect contextual separation could be achieved using current architecture (which no one has demonstrated is possible) that this would completely solve the problem or would protect against other search-result poisoning attacks. Separating search context from system prompts wouldn't protect against my attack where I get Phind to refuse to quote certain sources, because all I did there was tell Phind in my search result that none of the other search results could be trusted or that that they should be summarized differently. I don't know how to block an attack like that without breaking Phind's ability to consume search results in a useful way.

But note that the above is still kind of a future problem -- the bigger immediate problem is that more careful system prompts and lower temperatures haven't worked as a defense in the fist place. I don't see much evidence that it is even possible with current LLM infrastructure to give any system prompt that can't be overridden later in the conversation. At the very least, I'm not aware of any public demonstration of an unhackable system prompt that hasn't ended up getting hacked.

----

> Side note - I saw upthread that you were looking to rebrand “prompt injection”. I propose “behaviour induction” or “induced behaviour”

I'm fine with anything that doesn't confuse people. I don't have strong opinions on it, I don't find "prompt injection" too confusing myself, it's "injecting" a system "prompt" into the middle of a conversation or dataset, and that injection can be performed by a non-user malicious 3rd-party. So I don't really mind any wording, I just can't deny that lots of people do get confused by the wording and seem to interpret prompt injection incorrectly in very similar ways. To me that suggests that something about the wording is throwing them off.

So if a lot of people start using any wording that doesn't have that problem, I'll use it regardless of how I personally feel about it. And if anyone wants to use different terminology for their own conversations, when talking with them I'll use whatever terminology they find clearest.


Thanks for the thoughtful and in-depth response. I see what you’re saying and don’t have any rebuttals/followups.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: