Back to Blog
DevOps12 min readFebruary 20, 2025

Building a Full Observability Stack with Prometheus and Grafana

How to set up a production observability stack using Prometheus, Grafana, Loki, and Alertmanager. Covers metric collection, log aggregation, dashboards, and on-call alerting.

PrometheusGrafanaMonitoringObservabilityDevOps
A

Azam

DevOps & AI Consultant

The Three Pillars: Metrics, Logs, and Traces

Observability is not the same as monitoring. Monitoring tells you when something is broken. Observability lets you understand why. A complete observability stack covers three pillars: metrics (what is happening), logs (what happened in detail), and traces (how a request moved through the system). This guide focuses on the open-source stack — Prometheus for metrics, Loki for logs, and Grafana for visualisation and alerting — which is production-grade and runs entirely on your own infrastructure.

Prometheus Setup and Service Discovery

Prometheus scrapes metrics from HTTP endpoints on a configurable interval. In Kubernetes, it uses service discovery to automatically find pods with the correct annotations — no manual target configuration needed.

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Annotate your pods with prometheus.io/scrape: "true" and prometheus.io/port: "3000" to opt them into scraping.

Instrumenting Your Application

Expose a /metrics endpoint from your Node.js API using prom-client. Track the RED metrics — Rate, Errors, Duration — for every service.

import client from 'prom-client'

const registry = new client.Registry()
client.collectDefaultMetrics({ register: registry })

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
  registers: [registry],
})

// Middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer()
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path, status_code: res.statusCode })
  })
  next()
})

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType)
  res.send(await registry.metrics())
})

Log Aggregation with Loki and Promtail

Loki stores logs indexed only by labels (not by content), making it dramatically cheaper than Elasticsearch for log storage. Promtail ships logs from your pods to Loki, using Kubernetes metadata to add labels automatically.

# promtail-config.yml
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            msg: message
      - labels:
          level:
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Grafana Dashboards: The Metrics That Matter

Build one dashboard per service, and one infrastructure overview dashboard. Every service dashboard should answer four questions at a glance: Is request rate normal? Is error rate acceptable? Is latency within SLA? Are resources (CPU, memory) being exhausted?

# Key PromQL queries for your dashboards

# Request rate
rate(http_request_duration_seconds_count[5m])

# Error rate
rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
/ rate(http_request_duration_seconds_count[5m])

# P99 latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# CPU usage per pod
rate(container_cpu_usage_seconds_total[5m])

Alertmanager: On-Call Alerting

Define alerts in Prometheus rules files and route them through Alertmanager to PagerDuty, Slack, or email. Write alerts that page people only when human action is actually required — alert fatigue from noisy alerts is a serious operational hazard.

# alert_rules.yml
groups:
  - name: api.rules
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_request_duration_seconds_count{status_code=~"5.."}[5m])
          / rate(http_request_duration_seconds_count[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

Every alert should have a runbook link. An alert without a runbook is a page that will be ignored or resolved incorrectly at 3am.

Want to Build This for Your Team?

I help teams implement the patterns and architectures described in these articles. Let's talk about your project.

Book a Free Call