LLM Observability in Production with Langfuse

The Observability Gap in LLM Applications

Standard APM tools — Datadog, New Relic, Sentry — tell you when requests fail and how long they take. They cannot tell you why an LLM gave a bad answer, which step in a multi-agent chain caused a quality regression, or how your cost per user request has changed after a prompt update. Langfuse fills this gap: it is purpose-built tracing for LLM applications, open-source, and self-hostable.

This guide covers instrumenting a production LLM application from basic call tracing through evaluation pipelines and cost dashboards.

Basic Setup and Tracing

npm install langfuse

# .env
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted URL

import { Langfuse } from 'langfuse'

const langfuse = new Langfuse()

async function answerQuestion(question: string, userId: string) {
  const trace = langfuse.trace({
    name: 'answer-question',
    userId,
    input: { question },
    metadata: { feature: 'support-chat' },
  })

  // Trace the retrieval step
  const retrievalSpan = trace.span({ name: 'retrieval', input: { question } })
  const docs = await retrieveDocuments(question)
  retrievalSpan.end({ output: { docCount: docs.length } })

  // Trace the generation step
  const generation = trace.generation({
    name: 'generate-answer',
    model: 'claude-3-5-sonnet-20241022',
    input: buildMessages(question, docs),
  })

  const response = await callClaude(question, docs)

  generation.end({
    output: response.content,
    usage: {
      input: response.usage.input_tokens,
      output: response.usage.output_tokens,
    },
  })

  trace.update({ output: { answer: response.content } })
  await langfuse.flushAsync()

  return response.content
}

Tracing LangChain Calls

If you use LangChain, Langfuse provides a callback handler that instruments all LLM calls, chain executions, and tool invocations automatically — no manual span creation needed.

import { CallbackHandler } from 'langfuse-langchain'

const handler = new CallbackHandler({
  userId: currentUser.id,
  sessionId: sessionId,
  metadata: { environment: 'production' },
})

const chain = RunnableSequence.from([prompt, model, outputParser])
const result = await chain.invoke(
  { question },
  { callbacks: [handler] }
)

Scoring and Evaluation

Traces are useful for debugging individual failures. Scores make quality trends visible across thousands of traces. Add scores from user feedback, automated evals, or human review.

// User thumbs-down feedback
async function submitFeedback(traceId: string, isPositive: boolean) {
  await langfuse.score({
    traceId,
    name: 'user-feedback',
    value: isPositive ? 1 : 0,
    comment: 'User explicit feedback',
  })
}

// Automated faithfulness scoring
async function scoreTrace(traceId: string, answer: string, context: string) {
  const faithfulness = await evaluateFaithfulness(answer, context)
  await langfuse.score({
    traceId,
    name: 'faithfulness',
    value: faithfulness,  // 0-1
  })
}

Track scores over time by prompt version and model. When a score drops after a deployment, Langfuse's comparison view shows exactly which traces degraded — giving you the inputs that revealed the regression.

Cost Tracking and Dashboards

Langfuse calculates token costs per trace using model-specific pricing. The built-in dashboard shows cost by user, by feature, by model, and over time — without any additional instrumentation beyond passing usage in your generation calls.

Set up a cost per user per day dashboard to spot abusive usage early
Track cost per successful trace as your key efficiency metric — cheaper is only good if quality holds
Compare cost before and after prompt changes — more verbose prompts that improve quality may still cost more overall

Self-Hosting Langfuse

For applications where production prompts and user inputs are sensitive, self-host Langfuse. The Docker Compose setup runs in under 10 minutes on any VM with 4GB RAM.

# docker-compose.yml — from langfuse/langfuse
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    depends_on: [db]
    ports: ["3000:3000"]
    environment:
      DATABASE_URL: postgresql://postgres:password@db:5432/langfuse
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      SALT: ${SALT}

  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse

Update your SDK configuration to point to your self-hosted instance: LANGFUSE_HOST=https://langfuse.yourdomain.com. All data stays within your infrastructure.

LLM Observability in Production with Langfuse

The Observability Gap in LLM Applications

Basic Setup and Tracing

Tracing LangChain Calls

Scoring and Evaluation

Cost Tracking and Dashboards

Self-Hosting Langfuse

Jaspi.io — AI Hiring Platform

How to Build a Production RAG System with LangChain and OpenAI

Building Multi-Agent AI Systems with LangGraph

Want to Build This for Your Team?