Prompt Injection & Safety

Prompt injection is adversarial text that tries to hijack the AI's instructions, either typed directly by a user or hidden inside content the model reads. Untrusted text in context is the core risk.

Learn Prompt Injection & Safety in our free Prompt Engineering course — a beginner-friendly interactive lesson with worked examples, a practice exercise and…

Part of the free Prompt Engineering course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

This lesson covers direct and indirect injection, jailbreaks, and the defenses that keep AI systems safe, plus the difference between a refusal and a safe completion.

Injection tries to make the model follow attacker text. It can come straight from a user, or hide inside content the AI reads:

The danger is that the model may not distinguish your trusted instructions from text that merely sits in the context. Any user input or retrieved data could be read as a command.

No single fix is enough. Combine several layers:

Refusal vs safe completion: a refusal declines a request entirely; a safe completion still helps but stays within safe bounds. Aim for safe, helpful responses where possible; reserve refusals for genuinely unsafe asks.

⏱ Test Yourself — Timed Quiz

10 quick questions, 12 seconds each. Instant feedback — beat the clock!

Practice quiz

What is prompt injection?

  • When adversarial text in the input tries to hijack the AI's instructions
  • A way to speed up the AI
  • A type of font
  • A backup method

Answer: When adversarial text in the input tries to hijack the AI's instructions. Prompt injection is malicious input that tries to override the intended instructions.

Direct prompt injection is when…

  • You delete a file
  • The screen dims
  • A user types a malicious instruction straight into the chat
  • The model crashes

Answer: A user types a malicious instruction straight into the chat. Direct injection comes straight from the user's own input.

Indirect prompt injection hides malicious instructions in…

  • A password
  • Untrusted content the AI reads, like a web page or document
  • The system clock
  • The keyboard

Answer: Untrusted content the AI reads, like a web page or document. Indirect injection is embedded in content the AI processes, such as a fetched page.

A 'jailbreak' attempts to…

  • Change the font
  • Make the AI faster
  • Save a file
  • Trick the AI into ignoring its safety rules and guardrails

Answer: Trick the AI into ignoring its safety rules and guardrails. A jailbreak tries to bypass the model's safety constraints.

Why is untrusted text in context dangerous?

  • The model may treat injected text as instructions to follow
  • It is never dangerous
  • It makes text bigger
  • It deletes the prompt

Answer: The model may treat injected text as instructions to follow. If the model treats retrieved or user content as commands, attackers can steer it.

A core defense is to…

  • Remove all guardrails
  • Separate trusted instructions from untrusted data
  • Run more loops
  • Trust everything

Answer: Separate trusted instructions from untrusted data. Keeping instructions and data apart limits what injected text can do.

'Least privilege' for tools means…

  • Use the biggest model
  • Give the AI every possible permission
  • Disable the AI
  • Grant only the minimal tool access the task needs

Answer: Grant only the minimal tool access the task needs. Least privilege limits the blast radius if the AI is manipulated.

A safe rule for retrieved or user content is to…

  • Delete it
  • Forward it to everyone
  • Never treat it as instructions, only as data to consider
  • Always obey it as instructions

Answer: Never treat it as instructions, only as data to consider. Treat external content as data, not commands the AI must follow.

Input and output filtering helps by…

  • Changing the language
  • Catching dangerous content before or after the model processes it
  • Hiding the answer
  • Slowing the AI down for fun

Answer: Catching dangerous content before or after the model processes it. Filtering screens inputs and outputs for risky content.

What is the difference between a refusal and a safe completion?

  • A refusal declines entirely; a safe completion helps within safe bounds
  • A refusal is faster only
  • Safe completion deletes data
  • They are identical

Answer: A refusal declines entirely; a safe completion helps within safe bounds. A refusal says no outright; a safe completion provides a helpful, bounded response.