Design an Observability Stack for 50+ Microservices on EKS
Build a production observability platform using OpenTelemetry, Prometheus, Grafana, Loki, and Tempo to cover metrics, logs, and traces for a 50-service EKS environment.
You've joined as the SRE lead for a 50-service e-commerce platform on EKS. The team currently has no central observability — each service logs to its own CloudWatch log group, there are no distributed traces, and alerting is ad-hoc. When a checkout failure occurs, it takes 45 minutes to find the root cause because engineers are grep-ing logs across 12 different log groups. Build an observability platform.
The Problem
Without centralized observability, you’re flying blind. Debugging requires SSH-ing into pods, grepping logs, and guessing at root causes. The three pillars — metrics, logs, traces — must be collected centrally and correlated by request ID so you can jump from a slow trace to the logs that explain why.
Observability Architecture
┌─────────────────────────────────┐
│ GRAFANA (Unified UI) │
└──────────┬──────────┬────────────┘
│ │
┌────────────────┼──────────┼──────────────┐
│ │ │ │
METRICS LOGS TRACES COST
Prometheus Loki Tempo Kubecost
+ Thanos (Fluent (Jaeger (per ns
(long-term) Bit) OTEL) /service)
│ │ │
└────────────────┴──────────┘
│
OpenTelemetry Collector
(DaemonSet on every node)
│
┌──────────┴──────────┐
Application Pods Node Metrics
(OTLP SDK) (node-exporter)
Step 1: Deploy OpenTelemetry Collector
The OpenTelemetry Collector runs as a DaemonSet — one instance per node — collecting telemetry from all pods on that node:
# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
# otel-collector-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
kubeletstats:
collection_interval: 30s
auth_type: serviceAccount
endpoint: "${env:K8S_NODE_NAME}:10250"
insecure_skip_verify: true
processors:
batch:
timeout: 5s
send_batch_size: 1000
resource:
attributes:
- action: insert
key: cluster.name
value: prod-eks-us-east-1
- action: insert
key: environment
value: production
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: otel
otlp/tempo:
endpoint: http://tempo.monitoring:4317
tls:
insecure: true
loki:
endpoint: http://loki.monitoring:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
job: true
service:
pipelines:
metrics:
receivers: [otlp, kubeletstats]
processors: [batch, resource]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
Step 2: Instrument Your Microservices
Add OpenTelemetry SDK to each service. Here’s a Java Spring Boot example:
// build.gradle — add OTEL dependencies
dependencies {
implementation 'io.opentelemetry:opentelemetry-api:1.32.0'
implementation 'io.opentelemetry:opentelemetry-sdk:1.32.0'
implementation 'io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter:1.32.0'
}
# application.yml — configure OTEL export
management:
tracing:
sampling:
probability: 0.1 # Sample 10% of traces (adjust based on volume)
otel:
exporter:
otlp:
endpoint: http://otel-collector.monitoring:4317
resource:
attributes:
service.name: payment-service
service.version: "${BUILD_VERSION}"
deployment.environment: production
For Node.js:
// tracing.js — loaded before app.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector.monitoring:4317'
}),
serviceName: 'order-service',
});
sdk.start();
Step 3: Deploy Prometheus + Grafana With Helm
# Deploy kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="$GRAFANA_PASS" \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
# Deploy Loki for log aggregation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set fluent-bit.enabled=true \ # Fluent Bit DaemonSet for log collection
--set loki.storage.type=s3 \
--set loki.storage.s3.bucketnames=prod-loki-logs \
--set loki.storage.s3.region=us-east-1
# Deploy Tempo for distributed tracing
helm install tempo grafana/tempo \
--namespace monitoring \
--set storage.trace.backend=s3 \
--set storage.trace.s3.bucket=prod-tempo-traces \
--set storage.trace.s3.region=us-east-1
Step 4: SLO Alerting — The Right Way to Alert
Stop alerting on symptoms (CPU > 80%). Alert on user impact using SLO burn rate:
# Prometheus recording rules — precompute SLI
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-recording-rules
namespace: monitoring
spec:
groups:
- name: payment-service.slo
rules:
# SLI: fraction of requests that are successful
- record: job:http_request_success_rate:rate5m
expr: |
sum(rate(http_requests_total{job="payment-service",code!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-service"}[5m]))
# Alert: burn rate > 14.4× means you'll exhaust the error budget in 1 hour
- alert: PaymentServiceSLOBudgetBurnHigh
expr: |
job:http_request_success_rate:rate5m < 0.999
and
(1 - job:http_request_success_rate:rate5m) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
service: payment-service
annotations:
summary: "Payment service burning error budget 14x too fast"
description: |
Error rate {{ $value | humanizePercentage }} — will exhaust monthly
error budget in < 1 hour at this rate.
Runbook: https://wiki.company.com/runbooks/payment-service-slo
# Warn at 6× burn rate (exhausts budget in ~1 day)
- alert: PaymentServiceSLOBudgetBurnMedium
expr: |
job:http_request_success_rate:rate5m < 0.999
and
(1 - job:http_request_success_rate:rate5m) > (6 * 0.001)
for: 15m
labels:
severity: warning
Step 5: RED Method Dashboards
Every service gets a RED dashboard: Rate, Errors, Duration.
{
"panels": [
{
"title": "Request Rate (req/s)",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_requests_total{job=\"$service\"}[5m]))",
"legendFormat": "RPS"
}]
},
{
"title": "Error Rate (%)",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_requests_total{job=\"$service\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"$service\"}[5m])) * 100",
"legendFormat": "Error %"
}]
},
{
"title": "p50 / p95 / p99 Latency",
"type": "graph",
"targets": [
{"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p50"},
{"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p95"},
{"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"$service\"}[5m])) by (le))", "legendFormat": "p99"}
]
}
]
}
Step 6: Kubecost — Cost Visibility Per Service
Add cost observability to complement technical metrics:
helm repo add cost-analyzer https://kubecost.github.io/cost-analyzer/
helm install kubecost cost-analyzer/cost-analyzer \
--namespace kubecost \
--set kubecostToken="<token>" \
--set global.prometheus.enabled=false \
--set global.prometheus.fqdn=http://kube-prometheus-prometheus.monitoring:9090
Kubecost shows cost per namespace, deployment, and label:
# API example: get cost for payment-service namespace
curl http://kubecost.monitoring:9090/model/allocation \
--data 'window=30d&aggregate=namespace&namespace=payment-service'
Observability Checklist
| Pillar | Tool | Status |
|---|---|---|
| Metrics | Prometheus + Grafana RED dashboards | ✅ |
| Logs | Loki + Fluent Bit DaemonSet | ✅ |
| Traces | Tempo + OpenTelemetry | ✅ |
| SLO alerting | Prometheus rules + burn rate | ✅ |
| Cost visibility | Kubecost per namespace | ✅ |
| Alertmanager routing | PagerDuty (P1), Slack (P2/P3) | ✅ |
- How to instrument microservices with OpenTelemetry for unified observability
- How to deploy the LGTM stack (Loki, Grafana, Tempo, Mimir) on EKS
- How to define and alert on SLO burn rate for each service
- How to use the RED method for service dashboards
- How Kubecost provides cost visibility per service/namespace
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Build a Zero-Downtime Deployment Pipeline for Microservices on EKS
The Problem A traditional kubectl apply replaces all pods simultaneously — if the new image is broken, users hit errors until you notice and …
EKS Pods Getting OOMKilled in Production — Diagnose and Fix
The Problem OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its …
Migrate from ECS Fargate to EKS With Zero Downtime
The Problem A big-bang cutover from ECS to EKS is too risky — if EKS has issues, you’ve already disconnected ECS. The Strangler Fig …