Prompt Engineering

Prompt engineering is the craft of shaping an LLM's input so its output is reliable, accurate, and in the format you need. Modern models are powerful but suggestible — small changes in wording, ordering, or examples can swing accuracy by 20+ points on hard tasks. Think of a prompt as a tiny program written in natural language.

Why it matters

An LLM is a next-token predictor conditioned on everything in its context window. The prompt is the only lever you have at inference time (short of fine-tuning). Good prompting closes the gap between a model's raw capability and your task — often replacing the need for fine-tuning entirely.

The anatomy of a strong prompt

Most production prompts contain some or all of these blocks, usually in this order:

Role / persona — 'You are a senior tax accountant…'
Task — what the model must do, in one clear sentence
Context — background, retrieved documents, user data
Instructions / rules — do's and don'ts, edge cases, tone
Examples (few-shot) — 1–5 input → output demonstrations
Output format — JSON schema, markdown structure, length limits
Input — the actual user query, clearly delimited

Zero-shot vs few-shot

Zero-shot asks the model to perform a task with only an instruction. It works well for tasks the model has seen many times in pretraining (summarize, translate, classify common categories).

Few-shot provides 1–5 examples of the input/output pattern. It is often the single biggest accuracy lift you can get, especially for unusual formats, domain-specific labels, or stylistic consistency. Order matters — put the most representative example last.

Chain-of-thought (CoT)

Asking the model to 'think step by step' before answering dramatically improves math, logic, and multi-hop reasoning. Two flavors:

Zero-shot CoT — append 'Let's think step by step.' to the prompt.
Few-shot CoT — show worked examples where the reasoning is written out before the final answer.

For reasoning models (o1, o3, DeepSeek R1, Claude 'extended thinking', Gemini 'thinking'), CoT is built into the model — you should NOT add 'think step by step' instructions, and you should keep prompts shorter and more direct. Over-prompting actively hurts these models.

Self-consistency and majority voting

Sample the model N times at temperature > 0, then take the majority answer. Trades cost for reliability on problems with a single correct answer (math, classification, extraction). Often beats a single low-temperature call.

Structured output

When you need machine-readable output, force structure — do not parse free-form prose. Best to worst:

Native structured outputs / JSON mode — OpenAI, Anthropic, Gemini, and most OSS inference servers support a JSON schema you supply. The decoder is constrained, so the output is guaranteed to parse.
Tool / function calling — describe a function signature; the model returns typed arguments. Use this whenever the output will be passed to code.
Few-shot JSON examples — show 2–3 input → JSON pairs and ask for the same shape.
Regex / grammar-constrained decoding (vLLM, llama.cpp, Outlines) — for OSS deployments needing strict grammars.

Decoding parameters that matter

temperature — 0 for deterministic extraction/classification, 0.2–0.4 for grounded writing, 0.7–1.0 for creative work.
top_p (nucleus) — sample only from tokens whose cumulative probability ≤ p (typical: 0.9–0.95). Prefer adjusting temperature OR top_p, not both.
max_tokens — set an upper bound to control cost and avoid runaway generations.
stop sequences — end generation when the model emits a known delimiter (e.g. '</answer>').
seed — for reproducibility during evals (supported by OpenAI, Together, vLLM, etc.).

Practical patterns that consistently work

Delimit user input with XML tags or triple backticks to prevent prompt injection (e.g. `<user_input>{{q}}</user_input>`).
Put instructions before AND after long context — models attend most strongly to the start and end (the 'lost in the middle' effect).
Ask for the answer last — instructions → context → question → 'Answer:'.
Specify what NOT to do as well as what to do; negative constraints are not free, but they help.
Decompose hard tasks — break one big prompt into a pipeline of small, testable prompts (extract → analyze → format).
Give the model an escape hatch — 'If you are not sure, reply with NO_MATCH.' This sharply reduces hallucination.
Use system vs user messages correctly — durable rules go in the system message; turn-specific input in user messages.

Advanced techniques

ReAct — interleave Reasoning and Acting (tool calls) in a loop. Foundation of most modern agents.
Tree of Thoughts (ToT) — explore multiple reasoning branches, score them, keep the best.
Reflexion / self-critique — have the model critique and revise its own draft before returning a final answer.
Plan-and-solve — first write a plan, then execute each step. Helps multi-step problems.
Prompt chaining — output of prompt A becomes input to prompt B. Easier to debug than one mega-prompt.
Retrieval-augmented prompting (RAG) — inject relevant docs at query time; see the RAG topic for the full pipeline.

Prompt injection and safety

Anything in the context window can override your instructions if you are not careful. Defenses:

Treat all user-supplied text and retrieved documents as untrusted.
Use structured delimiters and reiterate critical rules after the untrusted block.
For agents, sandbox tool calls and require human confirmation for destructive actions.
Adversarial test with known jailbreak suites (PromptBench, Garak) before shipping.

Evaluating prompts

Prompts are code — version them and test them. A minimal eval loop:

Build a golden dataset of 20–200 real inputs with expected outputs (or rubrics).
Run the prompt at temperature 0 (or N samples averaged).
Score with exact match, regex, embeddings similarity, or LLM-as-judge.
Track results per prompt version, per model. Never ship a prompt change without running the eval.

Model-specific notes

GPT-4o / GPT-4.1 — responds well to system messages, structured outputs are first-class.
Claude (3.5/4) — loves XML tags for delimiting; pre-fill the assistant turn to control format ('Assistant: {').
Gemini — strong with very long contexts; put the task at the end after large documents.
Llama / Mistral / Qwen (open weights) — sensitive to chat template formatting; always use the tokenizer's built-in chat template.
Reasoning models (o-series, R1, thinking modes) — be terse, state the goal, do NOT add CoT instructions, and avoid few-shot for pure reasoning tasks.