Prompt Engineering for Production LLM Applications

Why Prompt Engineering Still Matters

As models get smarter, the gap between a naively written prompt and a carefully engineered one has not shrunk — it has grown. More capable models follow instructions more precisely, which means poorly specified instructions produce more confidently wrong outputs. Production prompt engineering is about reducing variance, improving reliability, and making model behavior predictable across the full distribution of user inputs.

System Prompt Architecture

Think of the system prompt as a job description. It defines the model's role, constraints, output format, and fallback behavior. A well-structured system prompt covers four sections: role definition, behavioral rules, output format specification, and examples.

SYSTEM_PROMPT = """
## Role
You are a contract review assistant for a law firm. You analyse commercial contracts
and identify clauses that deviate from standard terms.

## Rules
- Only analyse the contract text provided. Do not make up clauses that are not present.
- If a section is missing from the contract, note its absence explicitly.
- Flag any clause that limits liability to under $1M as HIGH RISK.
- Never provide legal advice. Flag items for attorney review.

## Output Format
Return a JSON object with this exact structure:
{
  "risk_level": "HIGH" | "MEDIUM" | "LOW",
  "flagged_clauses": [{ "clause": string, "risk": string, "reason": string }],
  "missing_sections": string[],
  "summary": string
}

## Example
[Include a worked example here for complex tasks]
"""

Few-Shot Prompting for Consistent Output

For classification tasks or structured extractions, include 2–4 worked examples in the prompt. Examples are the most reliable way to communicate the expected output format, edge case handling, and tone — especially when the task is difficult to specify in pure natural language.

messages = [
  {"role": "user", "content": "Classify: 'The app crashed when I tried to upload a file'"},
  {"role": "assistant", "content": '{"category": "bug", "severity": "high", "component": "file-upload"}'},
  {"role": "user", "content": "Classify: 'Can you add dark mode to the dashboard?'"},
  {"role": "assistant", "content": '{"category": "feature-request", "severity": "low", "component": "ui"}'},
  {"role": "user", "content": f"Classify: '{user_input}'"},
]

Place examples after the system prompt and before the actual user input. Use real examples from your training data rather than synthetic ones — real examples capture edge cases that synthetic ones miss.

Chain-of-Thought for Reasoning Tasks

For tasks that require multi-step reasoning — math, code analysis, complex decisions — adding "think step by step" or using explicit scratchpad instructions dramatically improves accuracy. The model's internal reasoning process is more reliable when it is externalised into the output.

SYSTEM = """Before giving your final answer, think through the problem step by step
inside  tags. Then provide your answer inside  tags.

Example:

The user is asking about X. First I need to consider Y, then Z...

Final answer here"""

With Claude's extended thinking feature, this reasoning is handled natively by the model. With other models, explicit scratchpad instructions achieve a similar effect.

Structured Output Reliability

When you need JSON output, use the model's native structured output feature rather than asking for JSON in the prompt. OpenAI's response_format: { type: "json_object" } and Anthropic's tool use with JSON schema both enforce valid JSON at the API level, eliminating parse errors.

# OpenAI structured output
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_schema", "json_schema": {
        "name": "contract_analysis",
        "schema": {
            "type": "object",
            "properties": {
                "risk_level": {"type": "string", "enum": ["HIGH", "MEDIUM", "LOW"]},
                "flagged_clauses": {"type": "array", "items": {...}},
            },
            "required": ["risk_level", "flagged_clauses"]
        }
    }},
    messages=messages
)

Evaluation-Driven Iteration

Never ship a prompt change without evaluating it against a test set. Build a golden dataset of 50–100 input/output pairs that cover your important cases and edge cases. Measure accuracy, hallucination rate, and output format compliance before and after every prompt change.

Track prompt versions in git — every change is a commit with a clear message
Run evals in CI — block merges if accuracy drops below threshold
Log model inputs and outputs in production — real failure cases become test cases
A/B test major prompt changes with a percentage of real traffic before full rollout

The best prompt engineers are not the most creative writers — they are the most systematic testers. Treat prompt development like any other software development process: measure, iterate, and validate.

Prompt Engineering for Production LLM Applications

Why Prompt Engineering Still Matters

System Prompt Architecture

Few-Shot Prompting for Consistent Output

Chain-of-Thought for Reasoning Tasks

Structured Output Reliability

Evaluation-Driven Iteration

Jaspi.io — AI Hiring Platform

How to Build a Production RAG System with LangChain and OpenAI

Building Multi-Agent AI Systems with LangGraph

Want to Build This for Your Team?