Back to Blog
DevOps9 min readJuly 5, 2024

Monitoring and Alerting for AI Applications with Datadog

How to set up comprehensive monitoring for AI-powered applications using Datadog. Covers custom LLM metrics, distributed tracing, cost dashboards, anomaly detection, and on-call workflows.

DatadogMonitoringAIObservabilityDevOps
A

Azam

DevOps & AI Consultant

Why Standard Monitoring Falls Short for AI Apps

Traditional APM monitoring catches infrastructure failures and latency spikes. AI applications introduce a new failure mode: the system is up, requests are returning 200, but the quality of the AI responses has degraded. A prompt regression, a model version update, or a shift in user input distribution can silently degrade user experience without triggering any standard alert. Monitoring AI applications requires tracking quality metrics alongside the standard operational metrics.

Instrumenting Your Application with DogStatsD

Send custom metrics from your application to Datadog's StatsD agent. Use consistent naming conventions: ai.llm.{metric} for LLM-specific metrics and app.{feature}.{metric} for feature-level business metrics.

import { StatsD } from 'hot-shots'

const dogstatsd = new StatsD({ host: 'localhost', port: 8125, prefix: 'myapp.' })

async function callLLM(prompt: string, model: string, feature: string) {
  const start = Date.now()
  try {
    const response = await llmClient.complete(prompt, { model })
    const duration = Date.now() - start

    dogstatsd.timing('ai.llm.latency', duration, [`model:${model}`, `feature:${feature}`])
    dogstatsd.increment('ai.llm.tokens.input', response.usage.inputTokens, [`model:${model}`])
    dogstatsd.increment('ai.llm.tokens.output', response.usage.outputTokens, [`model:${model}`])
    dogstatsd.increment('ai.llm.requests', 1, [`model:${model}`, `feature:${feature}`, 'status:success'])

    return response
  } catch (error) {
    dogstatsd.increment('ai.llm.requests', 1, [`model:${model}`, `feature:${feature}`, 'status:error'])
    throw error
  }
}

Distributed Tracing for Multi-Step AI Pipelines

AI applications often involve multiple sequential steps: retrieval, reranking, generation, post-processing. Distributed tracing with Datadog APM shows the time and errors at each step as a single trace — critical for debugging which part of a RAG pipeline is slow or failing.

import tracer from 'dd-trace'
tracer.init()

import { Span } from 'dd-trace'

async function processRAGQuery(query: string, userId: string) {
  return tracer.trace('rag.query', { resource: 'answer_question' }, async (span: Span) => {
    span.setTag('user.id', userId)
    span.setTag('query.length', query.length)

    const docs = await tracer.trace('rag.retrieval', {}, async () => {
      return retrieveDocuments(query)
    })
    span.setTag('retrieval.doc_count', docs.length)

    const answer = await tracer.trace('rag.generation', {}, async (genSpan: Span) => {
      const result = await generateAnswer(query, docs)
      genSpan.setTag('generation.tokens', result.usage.outputTokens)
      return result
    })

    return answer
  })
}

Cost Monitoring Dashboard

LLM costs are metered by token. Build a real-time cost dashboard from your token metrics so unexpected cost spikes are visible before they hit your invoice.

# Datadog metric formula for cost tracking
# (input_tokens * input_price + output_tokens * output_price) per hour

# For Claude 3.5 Sonnet pricing as of 2024
# Input: $3/million tokens, Output: $15/million tokens

# Create a computed metric in Datadog:
# (sum:myapp.ai.llm.tokens.input{model:claude-3-5-sonnet}.as_count() * 0.000003)
# + (sum:myapp.ai.llm.tokens.output{model:claude-3-5-sonnet}.as_count() * 0.000015)
# = hourly cost in USD

Set up cost alerts at 50%, 80%, and 100% of your daily budget using Datadog monitors on the computed cost metric. The 50% alert fires mid-day if you are on pace to double your expected spend.

Anomaly Detection for Quality Metrics

User feedback scores and automated quality metrics (faithfulness, relevance) fluctuate naturally. Use Datadog's anomaly detection monitors to alert only when metrics deviate significantly from their historical baseline — avoiding alert fatigue from normal variation.

# Datadog monitor configuration (via API or Terraform)
{
  "name": "AI Quality Regression — User Satisfaction",
  "type": "metric alert",
  "query": "anomalies(avg:myapp.ai.user_satisfaction{*}.rollup(avg, 300), 'basic', 3)",
  "message": "User satisfaction score is anomalously low. Check recent prompt deployments and model version. @pagerduty-ai-oncall",
  "thresholds": {
    "critical": 1
  },
  "options": {
    "notify_no_data": false,
    "evaluation_delay": 300
  }
}

On-Call Runbook Integration

Every Datadog monitor alert should link to a runbook. For AI-specific incidents, runbooks need to cover the scenarios that don't exist in traditional software: model provider outages, prompt regression detection, and quality degradation response procedures.

  • Model provider outage: Check provider status page → switch traffic to fallback provider → page on-call if fallback also unavailable
  • Cost spike: Identify which feature/user is consuming tokens → apply per-feature rate limit → investigate root cause
  • Quality regression: Check recent prompt deployments → roll back last prompt change → compare before/after quality metrics in Langfuse
  • Latency spike: Check provider latency dashboard → enable caching for repeated queries → consider switching to a faster/cheaper model for the affected feature

Want to Build This for Your Team?

I help teams implement the patterns and architectures described in these articles. Let's talk about your project.

Book a Free Call