The Observability Gap in LLM Applications
Standard APM tools — Datadog, New Relic, Sentry — tell you when requests fail and how long they take. They cannot tell you why an LLM gave a bad answer, which step in a multi-agent chain caused a quality regression, or how your cost per user request has changed after a prompt update. Langfuse fills this gap: it is purpose-built tracing for LLM applications, open-source, and self-hostable.
This guide covers instrumenting a production LLM application from basic call tracing through evaluation pipelines and cost dashboards.
Basic Setup and Tracing
npm install langfuse
# .env
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted URL
import { Langfuse } from 'langfuse'
const langfuse = new Langfuse()
async function answerQuestion(question: string, userId: string) {
const trace = langfuse.trace({
name: 'answer-question',
userId,
input: { question },
metadata: { feature: 'support-chat' },
})
// Trace the retrieval step
const retrievalSpan = trace.span({ name: 'retrieval', input: { question } })
const docs = await retrieveDocuments(question)
retrievalSpan.end({ output: { docCount: docs.length } })
// Trace the generation step
const generation = trace.generation({
name: 'generate-answer',
model: 'claude-3-5-sonnet-20241022',
input: buildMessages(question, docs),
})
const response = await callClaude(question, docs)
generation.end({
output: response.content,
usage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
},
})
trace.update({ output: { answer: response.content } })
await langfuse.flushAsync()
return response.content
}
Tracing LangChain Calls
If you use LangChain, Langfuse provides a callback handler that instruments all LLM calls, chain executions, and tool invocations automatically — no manual span creation needed.
import { CallbackHandler } from 'langfuse-langchain'
const handler = new CallbackHandler({
userId: currentUser.id,
sessionId: sessionId,
metadata: { environment: 'production' },
})
const chain = RunnableSequence.from([prompt, model, outputParser])
const result = await chain.invoke(
{ question },
{ callbacks: [handler] }
)
Scoring and Evaluation
Traces are useful for debugging individual failures. Scores make quality trends visible across thousands of traces. Add scores from user feedback, automated evals, or human review.
// User thumbs-down feedback
async function submitFeedback(traceId: string, isPositive: boolean) {
await langfuse.score({
traceId,
name: 'user-feedback',
value: isPositive ? 1 : 0,
comment: 'User explicit feedback',
})
}
// Automated faithfulness scoring
async function scoreTrace(traceId: string, answer: string, context: string) {
const faithfulness = await evaluateFaithfulness(answer, context)
await langfuse.score({
traceId,
name: 'faithfulness',
value: faithfulness, // 0-1
})
}
Track scores over time by prompt version and model. When a score drops after a deployment, Langfuse's comparison view shows exactly which traces degraded — giving you the inputs that revealed the regression.
Cost Tracking and Dashboards
Langfuse calculates token costs per trace using model-specific pricing. The built-in dashboard shows cost by user, by feature, by model, and over time — without any additional instrumentation beyond passing usage in your generation calls.
- Set up a cost per user per day dashboard to spot abusive usage early
- Track cost per successful trace as your key efficiency metric — cheaper is only good if quality holds
- Compare cost before and after prompt changes — more verbose prompts that improve quality may still cost more overall
Self-Hosting Langfuse
For applications where production prompts and user inputs are sensitive, self-host Langfuse. The Docker Compose setup runs in under 10 minutes on any VM with 4GB RAM.
# docker-compose.yml — from langfuse/langfuse
services:
langfuse-server:
image: langfuse/langfuse:latest
depends_on: [db]
ports: ["3000:3000"]
environment:
DATABASE_URL: postgresql://postgres:password@db:5432/langfuse
NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
SALT: ${SALT}
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
Update your SDK configuration to point to your self-hosted instance: LANGFUSE_HOST=https://langfuse.yourdomain.com. All data stays within your infrastructure.