Back to Blog
AI & LLM9 min readOctober 20, 2024

LLM Observability in Production with Langfuse

How to implement full LLM observability using Langfuse. Covers tracing multi-step chains, tracking token costs, evaluating output quality, and debugging agent failures in production.

LangfuseObservabilityLLMMonitoringMLOps
A

Azam

DevOps & AI Consultant

The Observability Gap in LLM Applications

Standard APM tools — Datadog, New Relic, Sentry — tell you when requests fail and how long they take. They cannot tell you why an LLM gave a bad answer, which step in a multi-agent chain caused a quality regression, or how your cost per user request has changed after a prompt update. Langfuse fills this gap: it is purpose-built tracing for LLM applications, open-source, and self-hostable.

This guide covers instrumenting a production LLM application from basic call tracing through evaluation pipelines and cost dashboards.

Basic Setup and Tracing

npm install langfuse

# .env
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted URL
import { Langfuse } from 'langfuse'

const langfuse = new Langfuse()

async function answerQuestion(question: string, userId: string) {
  const trace = langfuse.trace({
    name: 'answer-question',
    userId,
    input: { question },
    metadata: { feature: 'support-chat' },
  })

  // Trace the retrieval step
  const retrievalSpan = trace.span({ name: 'retrieval', input: { question } })
  const docs = await retrieveDocuments(question)
  retrievalSpan.end({ output: { docCount: docs.length } })

  // Trace the generation step
  const generation = trace.generation({
    name: 'generate-answer',
    model: 'claude-3-5-sonnet-20241022',
    input: buildMessages(question, docs),
  })

  const response = await callClaude(question, docs)

  generation.end({
    output: response.content,
    usage: {
      input: response.usage.input_tokens,
      output: response.usage.output_tokens,
    },
  })

  trace.update({ output: { answer: response.content } })
  await langfuse.flushAsync()

  return response.content
}

Tracing LangChain Calls

If you use LangChain, Langfuse provides a callback handler that instruments all LLM calls, chain executions, and tool invocations automatically — no manual span creation needed.

import { CallbackHandler } from 'langfuse-langchain'

const handler = new CallbackHandler({
  userId: currentUser.id,
  sessionId: sessionId,
  metadata: { environment: 'production' },
})

const chain = RunnableSequence.from([prompt, model, outputParser])
const result = await chain.invoke(
  { question },
  { callbacks: [handler] }
)

Scoring and Evaluation

Traces are useful for debugging individual failures. Scores make quality trends visible across thousands of traces. Add scores from user feedback, automated evals, or human review.

// User thumbs-down feedback
async function submitFeedback(traceId: string, isPositive: boolean) {
  await langfuse.score({
    traceId,
    name: 'user-feedback',
    value: isPositive ? 1 : 0,
    comment: 'User explicit feedback',
  })
}

// Automated faithfulness scoring
async function scoreTrace(traceId: string, answer: string, context: string) {
  const faithfulness = await evaluateFaithfulness(answer, context)
  await langfuse.score({
    traceId,
    name: 'faithfulness',
    value: faithfulness,  // 0-1
  })
}

Track scores over time by prompt version and model. When a score drops after a deployment, Langfuse's comparison view shows exactly which traces degraded — giving you the inputs that revealed the regression.

Cost Tracking and Dashboards

Langfuse calculates token costs per trace using model-specific pricing. The built-in dashboard shows cost by user, by feature, by model, and over time — without any additional instrumentation beyond passing usage in your generation calls.

  • Set up a cost per user per day dashboard to spot abusive usage early
  • Track cost per successful trace as your key efficiency metric — cheaper is only good if quality holds
  • Compare cost before and after prompt changes — more verbose prompts that improve quality may still cost more overall

Self-Hosting Langfuse

For applications where production prompts and user inputs are sensitive, self-host Langfuse. The Docker Compose setup runs in under 10 minutes on any VM with 4GB RAM.

# docker-compose.yml — from langfuse/langfuse
services:
  langfuse-server:
    image: langfuse/langfuse:latest
    depends_on: [db]
    ports: ["3000:3000"]
    environment:
      DATABASE_URL: postgresql://postgres:password@db:5432/langfuse
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      SALT: ${SALT}

  db:
    image: postgres:15
    environment:
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse

Update your SDK configuration to point to your self-hosted instance: LANGFUSE_HOST=https://langfuse.yourdomain.com. All data stays within your infrastructure.

Want to Build This for Your Team?

I help teams implement the patterns and architectures described in these articles. Let's talk about your project.

Book a Free Call