Scenario Advanced Aws AWS Observability

Design an Observability Stack for 50+ Microservices on EKS

Build a production observability platform using OpenTelemetry, Prometheus, Grafana, Loki, and Tempo to cover metrics, logs, and traces for a 50-service EKS environment.

January 20, 2025 5 min read ~40 min to complete DB
The Situation

You've joined as the SRE lead for a 50-service e-commerce platform on EKS. The team currently has no central observability — each service logs to its own CloudWatch log group, there are no distributed traces, and alerting is ad-hoc. When a checkout failure occurs, it takes 45 minutes to find the root cause because engineers are grep-ing logs across 12 different log groups. Build an observability platform.

6 Steps
7 Services Used
~40 min Duration
Advanced Difficulty

The Problem

Without centralized observability, you’re flying blind. Debugging requires SSH-ing into pods, grepping logs, and guessing at root causes. The three pillars — metrics, logs, traces — must be collected centrally and correlated by request ID so you can jump from a slow trace to the logs that explain why.

Observability Architecture

                    ┌─────────────────────────────────┐
                    │     GRAFANA (Unified UI)         │
                    └──────────┬──────────┬────────────┘
                               │          │
              ┌────────────────┼──────────┼──────────────┐
              │                │          │              │
        METRICS            LOGS       TRACES          COST
       Prometheus        Loki         Tempo         Kubecost
       + Thanos         (Fluent     (Jaeger         (per ns
       (long-term)       Bit)        OTEL)         /service)
              │                │          │
              └────────────────┴──────────┘
                               │
                    OpenTelemetry Collector
                    (DaemonSet on every node)
                               │
                    ┌──────────┴──────────┐
               Application Pods    Node Metrics
               (OTLP SDK)         (node-exporter)

Step 1: Deploy OpenTelemetry Collector

The OpenTelemetry Collector runs as a DaemonSet — one instance per node — collecting telemetry from all pods on that node:

# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector-contrib:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        volumeMounts:
        - name: config
          mountPath: /etc/otel
      volumes:
      - name: config
        configMap:
          name: otel-collector-config
# otel-collector-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "${env:K8S_NODE_NAME}:10250"
        insecure_skip_verify: true

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1000
      resource:
        attributes:
        - action: insert
          key: cluster.name
          value: prod-eks-us-east-1
        - action: insert
          key: environment
          value: production

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
        namespace: otel
      otlp/tempo:
        endpoint: http://tempo.monitoring:4317
        tls:
          insecure: true
      loki:
        endpoint: http://loki.monitoring:3100/loki/api/v1/push
        default_labels_enabled:
          exporter: false
          job: true

    service:
      pipelines:
        metrics:
          receivers: [otlp, kubeletstats]
          processors: [batch, resource]
          exporters: [prometheus]
        traces:
          receivers: [otlp]
          processors: [batch, resource]
          exporters: [otlp/tempo]
        logs:
          receivers: [otlp]
          processors: [batch, resource]
          exporters: [loki]

Step 2: Instrument Your Microservices

Add OpenTelemetry SDK to each service. Here’s a Java Spring Boot example:

// build.gradle — add OTEL dependencies
dependencies {
    implementation 'io.opentelemetry:opentelemetry-api:1.32.0'
    implementation 'io.opentelemetry:opentelemetry-sdk:1.32.0'
    implementation 'io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter:1.32.0'
}
# application.yml — configure OTEL export
management:
  tracing:
    sampling:
      probability: 0.1   # Sample 10% of traces (adjust based on volume)

otel:
  exporter:
    otlp:
      endpoint: http://otel-collector.monitoring:4317
  resource:
    attributes:
      service.name: payment-service
      service.version: "${BUILD_VERSION}"
      deployment.environment: production

For Node.js:

// tracing.js — loaded before app.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector.monitoring:4317'
  }),
  serviceName: 'order-service',
});
sdk.start();

Step 3: Deploy Prometheus + Grafana With Helm

# Deploy kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword="$GRAFANA_PASS" \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

# Deploy Loki for log aggregation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set fluent-bit.enabled=true \   # Fluent Bit DaemonSet for log collection
  --set loki.storage.type=s3 \
  --set loki.storage.s3.bucketnames=prod-loki-logs \
  --set loki.storage.s3.region=us-east-1

# Deploy Tempo for distributed tracing
helm install tempo grafana/tempo \
  --namespace monitoring \
  --set storage.trace.backend=s3 \
  --set storage.trace.s3.bucket=prod-tempo-traces \
  --set storage.trace.s3.region=us-east-1

Step 4: SLO Alerting — The Right Way to Alert

Stop alerting on symptoms (CPU > 80%). Alert on user impact using SLO burn rate:

# Prometheus recording rules — precompute SLI
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-recording-rules
  namespace: monitoring
spec:
  groups:
  - name: payment-service.slo
    rules:
    # SLI: fraction of requests that are successful
    - record: job:http_request_success_rate:rate5m
      expr: |
        sum(rate(http_requests_total{job="payment-service",code!~"5.."}[5m]))
        /
        sum(rate(http_requests_total{job="payment-service"}[5m]))

    # Alert: burn rate > 14.4× means you'll exhaust the error budget in 1 hour
    - alert: PaymentServiceSLOBudgetBurnHigh
      expr: |
        job:http_request_success_rate:rate5m < 0.999
        and
        (1 - job:http_request_success_rate:rate5m) > (14.4 * 0.001)
      for: 2m
      labels:
        severity: critical
        service: payment-service
      annotations:
        summary: "Payment service burning error budget 14x too fast"
        description: |
          Error rate {{ $value | humanizePercentage }} — will exhaust monthly
          error budget in < 1 hour at this rate.
          Runbook: https://wiki.company.com/runbooks/payment-service-slo

    # Warn at 6× burn rate (exhausts budget in ~1 day)
    - alert: PaymentServiceSLOBudgetBurnMedium
      expr: |
        job:http_request_success_rate:rate5m < 0.999
        and
        (1 - job:http_request_success_rate:rate5m) > (6 * 0.001)
      for: 15m
      labels:
        severity: warning

Step 5: RED Method Dashboards

Every service gets a RED dashboard: Rate, Errors, Duration.

{
  "panels": [
    {
      "title": "Request Rate (req/s)",
      "type": "graph",
      "targets": [{
        "expr": "sum(rate(http_requests_total{job=\"$service\"}[5m]))",
        "legendFormat": "RPS"
      }]
    },
    {
      "title": "Error Rate (%)",
      "type": "graph",
      "targets": [{
        "expr": "sum(rate(http_requests_total{job=\"$service\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"$service\"}[5m])) * 100",
        "legendFormat": "Error %"
      }]
    },
    {
      "title": "p50 / p95 / p99 Latency",
      "type": "graph",
      "targets": [
        {"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p50"},
        {"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p95"},
        {"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p99"}
      ]
    }
  ]
}

Step 6: Kubecost — Cost Visibility Per Service

Add cost observability to complement technical metrics:

helm repo add cost-analyzer https://kubecost.github.io/cost-analyzer/
helm install kubecost cost-analyzer/cost-analyzer \
  --namespace kubecost \
  --set kubecostToken="<token>" \
  --set global.prometheus.enabled=false \
  --set global.prometheus.fqdn=http://kube-prometheus-prometheus.monitoring:9090

Kubecost shows cost per namespace, deployment, and label:

# API example: get cost for payment-service namespace
curl http://kubecost.monitoring:9090/model/allocation \
  --data 'window=30d&aggregate=namespace&namespace=payment-service'

Observability Checklist

PillarToolStatus
MetricsPrometheus + Grafana RED dashboards
LogsLoki + Fluent Bit DaemonSet
TracesTempo + OpenTelemetry
SLO alertingPrometheus rules + burn rate
Cost visibilityKubecost per namespace
Alertmanager routingPagerDuty (P1), Slack (P2/P3)
Interview Angle
Interviewers want to hear the correlation story: “I can start from a slow trace in Tempo, jump to the Loki logs for that specific request ID, and then correlate with Prometheus metrics at that timestamp — all from within Grafana.” That’s the difference between observability and logging. Mention exemplars (Prometheus → Tempo trace links) as the technical mechanism that enables this.
Services Used
EKSPrometheusGrafanaOpenTelemetryAWS CloudWatchFluent BitKubecost
Prerequisites
  • Familiarity with Kubernetes and Helm
  • Basic understanding of Prometheus metrics format
  • Understanding of distributed tracing concepts (spans, traces)
What You Learned
  • How to instrument microservices with OpenTelemetry for unified observability
  • How to deploy the LGTM stack (Loki, Grafana, Tempo, Mimir) on EKS
  • How to define and alert on SLO burn rate for each service
  • How to use the RED method for service dashboards
  • How Kubecost provides cost visibility per service/namespace

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios