Back to Blog
AI & LLM11 min readMarch 20, 2025

Prompt Engineering for Production LLM Applications

Advanced prompt engineering techniques for production systems: few-shot prompting, chain-of-thought, structured output, prompt versioning, and evaluation-driven iteration.

Prompt EngineeringLLMOpenAIClaudeAI
A

Azam

DevOps & AI Consultant

Why Prompt Engineering Still Matters

As models get smarter, the gap between a naively written prompt and a carefully engineered one has not shrunk — it has grown. More capable models follow instructions more precisely, which means poorly specified instructions produce more confidently wrong outputs. Production prompt engineering is about reducing variance, improving reliability, and making model behavior predictable across the full distribution of user inputs.

System Prompt Architecture

Think of the system prompt as a job description. It defines the model's role, constraints, output format, and fallback behavior. A well-structured system prompt covers four sections: role definition, behavioral rules, output format specification, and examples.

SYSTEM_PROMPT = """
## Role
You are a contract review assistant for a law firm. You analyse commercial contracts
and identify clauses that deviate from standard terms.

## Rules
- Only analyse the contract text provided. Do not make up clauses that are not present.
- If a section is missing from the contract, note its absence explicitly.
- Flag any clause that limits liability to under $1M as HIGH RISK.
- Never provide legal advice. Flag items for attorney review.

## Output Format
Return a JSON object with this exact structure:
{
  "risk_level": "HIGH" | "MEDIUM" | "LOW",
  "flagged_clauses": [{ "clause": string, "risk": string, "reason": string }],
  "missing_sections": string[],
  "summary": string
}

## Example
[Include a worked example here for complex tasks]
"""

Few-Shot Prompting for Consistent Output

For classification tasks or structured extractions, include 2–4 worked examples in the prompt. Examples are the most reliable way to communicate the expected output format, edge case handling, and tone — especially when the task is difficult to specify in pure natural language.

messages = [
  {"role": "user", "content": "Classify: 'The app crashed when I tried to upload a file'"},
  {"role": "assistant", "content": '{"category": "bug", "severity": "high", "component": "file-upload"}'},
  {"role": "user", "content": "Classify: 'Can you add dark mode to the dashboard?'"},
  {"role": "assistant", "content": '{"category": "feature-request", "severity": "low", "component": "ui"}'},
  {"role": "user", "content": f"Classify: '{user_input}'"},
]

Place examples after the system prompt and before the actual user input. Use real examples from your training data rather than synthetic ones — real examples capture edge cases that synthetic ones miss.

Chain-of-Thought for Reasoning Tasks

For tasks that require multi-step reasoning — math, code analysis, complex decisions — adding "think step by step" or using explicit scratchpad instructions dramatically improves accuracy. The model's internal reasoning process is more reliable when it is externalised into the output.

SYSTEM = """Before giving your final answer, think through the problem step by step
inside  tags. Then provide your answer inside  tags.

Example:

The user is asking about X. First I need to consider Y, then Z...

Final answer here"""

With Claude's extended thinking feature, this reasoning is handled natively by the model. With other models, explicit scratchpad instructions achieve a similar effect.

Structured Output Reliability

When you need JSON output, use the model's native structured output feature rather than asking for JSON in the prompt. OpenAI's response_format: { type: "json_object" } and Anthropic's tool use with JSON schema both enforce valid JSON at the API level, eliminating parse errors.

# OpenAI structured output
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_schema", "json_schema": {
        "name": "contract_analysis",
        "schema": {
            "type": "object",
            "properties": {
                "risk_level": {"type": "string", "enum": ["HIGH", "MEDIUM", "LOW"]},
                "flagged_clauses": {"type": "array", "items": {...}},
            },
            "required": ["risk_level", "flagged_clauses"]
        }
    }},
    messages=messages
)

Evaluation-Driven Iteration

Never ship a prompt change without evaluating it against a test set. Build a golden dataset of 50–100 input/output pairs that cover your important cases and edge cases. Measure accuracy, hallucination rate, and output format compliance before and after every prompt change.

  • Track prompt versions in git — every change is a commit with a clear message
  • Run evals in CI — block merges if accuracy drops below threshold
  • Log model inputs and outputs in production — real failure cases become test cases
  • A/B test major prompt changes with a percentage of real traffic before full rollout

The best prompt engineers are not the most creative writers — they are the most systematic testers. Treat prompt development like any other software development process: measure, iterate, and validate.

Want to Build This for Your Team?

I help teams implement the patterns and architectures described in these articles. Let's talk about your project.

Book a Free Call