📊 Monitoring & Observability Interview Questions

Scenario-based interview questions for AWS CloudWatch, Prometheus, and Grafana

Your EC2 instance CPU utilization has been above 90% for 20 minutes, but the CloudWatch alarm configured at 80% threshold has not triggered. What would you investigate?

Intermediate

Ans:

Key Discussion Points:

Check alarm state: INSUFFICIENT_DATA, OK, or ALARM
Verify the metric namespace and dimension (InstanceId must match)
Check evaluation period and datapoints-to-alarm settings (e.g., 3 out of 5)
Confirm the CloudWatch agent is installed and running if using custom metrics
Verify IAM role attached to EC2 has cloudwatch:PutMetricData permission
Check if the alarm is using Average vs Maximum statistic — CPU spikes may average out
Confirm SNS topic subscription is confirmed for notification delivery

Sample Answer: “I would first check the alarm configuration in the console to verify the threshold, evaluation period, and statistic type. Then I’d confirm the metric dimensions match the instance ID and check whether the alarm is in INSUFFICIENT_DATA state due to missing metric data points.”

Your application running on EC2 is not sending logs to CloudWatch Logs despite the CloudWatch agent being installed. How do you troubleshoot?

Intermediate

Ans:

Key Discussion Points:

Check CloudWatch agent status: systemctl status amazon-cloudwatch-agent
Review agent config at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Verify IAM role has logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents
Check agent logs at /opt/aws/amazon-cloudwatch-agent/logs/
Confirm log file path in agent config matches actual application log path
Check if log group exists in the correct AWS region
Ensure the instance has outbound internet access or VPC endpoint for CloudWatch Logs

Sample Answer: “I’d start by checking the CloudWatch agent logs for errors, then verify the IAM role permissions. Next, I’d validate the agent config to ensure the log file paths are correct and the log group/stream names match what’s expected in CloudWatch.”

Your team wants to be alerted whenever a Lambda function's error rate exceeds 5% over a 5-minute window. How would you implement this?

Intermediate

Ans:

Key Discussion Points:

Use built-in Lambda metrics: Errors, Invocations in namespace AWS/Lambda
Create a CloudWatch metric math alarm: Errors / Invocations * 100 > 5
Set evaluation period to 1 period of 5 minutes, 1 out of 1 datapoints
Configure SNS topic → Email/Slack notification via Lambda or Chatbot
Consider using Lambda Insights for enhanced metrics
Enable X-Ray tracing for deeper error analysis
Set up a CloudWatch Dashboard to visualize error trends

Sample Answer: “I’d create a CloudWatch alarm using metric math to compute the error rate as Errors / Invocations * 100, set the threshold to 5%, and configure SNS to notify the on-call team. For deeper visibility, I’d also enable Lambda Insights and X-Ray.”

You manage 5 AWS accounts across different environments (dev, staging, prod). You want a single CloudWatch dashboard in a central monitoring account to show metrics from all accounts. How do you set this up?

Advanced

Ans:

Key Discussion Points:

Use CloudWatch cross-account observability (CloudWatch sharing)
In source accounts: create sharing role and enable metric sharing
In monitoring account: add source accounts as linked accounts
Create a central CloudWatch dashboard referencing cross-account metrics
Use AWS Organizations for simplified cross-account trust setup
Consider EventBridge cross-account event routing for alarms

Sample Answer: “I’d configure CloudWatch cross-account observability by setting up sharing roles in each source account and linking them to the central monitoring account. Then I’d build a unified dashboard in the central account that pulls metrics from all environments.”

Your AWS bill unexpectedly doubled this month. Your manager asks you to use CloudWatch to identify the cause. What steps do you take?

Intermediate

Ans:

Key Discussion Points:

Use AWS Cost Explorer alongside CloudWatch for cost attribution
Create billing alarms using AWS/Billing namespace metric EstimatedCharges
Look at AWS/EC2, AWS/RDS, AWS/S3 metrics for resource usage spikes
Check NetworkOut, RequestCount, DataTransfer metrics for data egress costs
Use CloudWatch Logs Insights to find Lambda invocation surges
Correlate with CloudTrail events for unexpected resource creation
Set up future billing alarms with SNS to catch spikes early

Sample Answer: “I’d first check billing alarms and Cost Explorer to pinpoint which service spiked. Then I’d correlate with CloudWatch metrics for that service — for example, if EC2 cost doubled, I’d look at instance count, NetworkOut, and running hours — and cross-reference with CloudTrail for any unauthorized resource creation.”

Your e-commerce application experiences flash sales. You need CloudWatch alarms to trigger Auto Scaling to add instances when CPU > 70% and remove them when CPU < 30%. Walk through your design.

Advanced

Ans: Key Discussion Points:

Create scale-out alarm: CPUUtilization > 70% for 2 consecutive 1-minute periods
Create scale-in alarm: CPUUtilization < 30% for 5 consecutive 1-minute periods
Link alarms to Auto Scaling Group (ASG) scaling policies
Use target tracking scaling with ASGAverageCPUUtilization for simpler setup
Consider step scaling vs simple scaling vs target tracking trade-offs
Add cooldown periods to prevent rapid scale-in/scale-out thrashing
Test with load testing tools like ab or locust to validate

Sample Answer: “I’d create two CloudWatch alarms tied to ASG scaling policies — a scale-out alarm at 70% CPU with a shorter evaluation window for quick response, and a scale-in alarm at 30% with a longer window to prevent premature scale-down. Target tracking scaling would be the simplest implementation.”

Your application logs ERROR and CRITICAL messages to CloudWatch Logs. You need to create an alarm that fires when more than 10 errors occur within 5 minutes. How do you implement this?

Intermediate

Ans: Key Discussion Points:

Create a metric filter on the log group with filter pattern [ERROR] or "ERROR"
Map matches to a custom metric: namespace App/Errors, metric name ErrorCount, value 1
Create CloudWatch alarm on the custom metric: Sum > 10 over 5-minute period
Use { $.level = "ERROR" } for JSON-structured logs
Consider separate filters for ERROR vs CRITICAL severity levels
Set treat missing data as good to avoid false alarms during low-traffic periods

Sample Answer: “I’d create a CloudWatch Log metric filter with pattern ERROR that publishes to a custom metric. Then I’d set up an alarm on that metric with a Sum statistic threshold of 10 over a 5-minute period, configured to treat missing data as not breaching.”

Users are reporting slow database queries during peak hours. How do you use CloudWatch to diagnose RDS performance issues?

Advanced

Ans:

Key Discussion Points:

Check AWS/RDS metrics: CPUUtilization, DatabaseConnections, FreeableMemory
Monitor ReadLatency, WriteLatency, DiskQueueDepth for I/O bottlenecks
Use ReadIOPS and WriteIOPS to check if hitting Provisioned IOPS limits
Enable Performance Insights for query-level analysis
Check ReplicaLag if using read replicas
Enable Enhanced Monitoring for OS-level metrics (1-second granularity)
Create CloudWatch dashboard combining all RDS metrics

Sample Answer: “I’d start by reviewing CloudWatch RDS metrics for CPU, connections, and latency. If latency is high with elevated DiskQueueDepth, it suggests an I/O bottleneck. I’d then enable Performance Insights to identify the specific slow queries and consider increasing IOPS or instance size.”

Your architecture uses ALB, ECS, RDS, and SQS. Your VP wants a single-pane-of-glass CloudWatch dashboard showing the health of the entire stack. How do you design it?

Advanced

Ans:

Key Discussion Points:

ALB widgets: RequestCount, TargetResponseTime, HTTPCode_ELB_5XX_Count, HealthyHostCount
ECS widgets: CPUUtilization, MemoryUtilization per service/cluster
RDS widgets: CPUUtilization, DatabaseConnections, ReadLatency
SQS widgets: NumberOfMessagesSent, ApproximateAgeOfOldestMessage, NumberOfMessagesDeleted
Add alarm status widgets for at-a-glance health
Use CloudFormation or CDK to codify dashboards as infrastructure-as-code
Set up auto-refresh interval (1 min for production dashboards)

Sample Answer: “I’d design the dashboard in sections — one row per service. For ALB I’d show request rate and error rate, for ECS I’d show CPU/memory per service, for RDS I’d show connections and latency, and for SQS I’d show queue depth and age. I’d add alarm status widgets at the top for a quick health summary.”

Q10

Your marketing team needs assurance that the public website is accessible 24/7 from multiple regions. How do you implement proactive monitoring using CloudWatch Synthetics?

Advanced

Ans:

Scenario: Your marketing team needs assurance that the public website is accessible 24/7 from multiple regions. How do you implement proactive monitoring using CloudWatch Synthetics?

Key Discussion Points:

Create a CloudWatch Synthetics canary using the Heartbeat blueprint
Configure it to run every 5 minutes from multiple regions using canary replication
Set success threshold on HTTP status code 200 and response time < 3s
Create CloudWatch alarm on canary SuccessPercent metric < 100%
Store screenshots and HAR files in S3 for failure analysis
Use API Canary blueprint for multi-step API transaction testing
Set up SNS notification for canary failures

Sample Answer: “I’d create CloudWatch Synthetics canaries using the Heartbeat blueprint targeting our public URLs, scheduled to run every minute. I’d set up alarms on the SuccessPercent metric and use SNS to alert the on-call engineer. For multi-region coverage, I’d deploy canaries in us-east-1, eu-west-1, and ap-southeast-1.”

Q11

One of your ECS tasks keeps getting OOM (Out of Memory) killed and restarting. How do you diagnose and set up monitoring to prevent recurrence?

Advanced

Ans:

Key Discussion Points:

Check MemoryUtilization metric in AWS/ECS namespace for the service
Look at ECS task stopped reason in ECS console: OutOfMemoryError: Container killed due to memory usage
Enable Container Insights for task-level and container-level metrics
Review application heap settings and memory limits in task definition
Create alarm: MemoryUtilization > 85% to catch before OOM
Use CloudWatch Logs to analyze heap dumps or memory leak patterns
Consider adjusting task definition memory reservation vs hard limit

Sample Answer: “I’d enable Container Insights to get granular memory metrics per container. I’d then review the ECS task stopped reason and CloudWatch Logs for memory-related errors. To prevent future OOM events, I’d set a CloudWatch alarm at 85% memory utilization and review the application’s memory configuration.”

Q12

Your business tracks Order Success Rate defined as SuccessfulOrders / TotalOrders * 100. Both metrics are custom CloudWatch metrics. How do you create an alarm when this rate drops below 95%?

Advanced

Ans:

Key Discussion Points:

Use CloudWatch Metric Math with expression e1 = m1 / m2 * 100
m1 = SuccessfulOrders metric, m2 = TotalOrders metric
Create alarm directly on the metric math expression
Handle division by zero by using IF(m2 > 0, m1/m2*100, 100) expression
Set alarm to treat missing data as breaching or ignore based on business need
Set period to 5 minutes with Sum statistic for both metrics

Sample Answer: “I’d create a CloudWatch alarm using Metric Math. I’d define m1 as SuccessfulOrders and m2 as TotalOrders, then write the expression IF(m2 > 0, m1/m2*100, 100) to safely compute the rate. I’d set the alarm threshold at 95% and notify the business team via SNS.”

Q13

Your application traffic varies wildly between weekdays and weekends, making static thresholds ineffective for alerting. How do you implement intelligent alerting?

Advanced

Ans: Key Discussion Points:

Enable CloudWatch Anomaly Detection on the relevant metric
ML model learns seasonal patterns (hourly, daily, weekly) over time
Create alarm using ANOMALY_DETECTION_BAND as threshold
Adjust the band width multiplier (e.g., 2 standard deviations)
Useful for metrics like RequestCount, Latency, ErrorCount
Takes 2 weeks of training data for reliable anomaly detection
Exclude known anomaly periods from training using exclusion windows

Sample Answer: “I’d enable CloudWatch Anomaly Detection on the RequestCount metric. After the model trains on 2 weeks of historical data, it automatically adjusts thresholds based on day-of-week and time-of-day patterns. This eliminates false alarms on weekends while still catching genuine anomalies.”

Q14

Your CloudWatch alarm fires at 2 AM but nobody responds to the SNS email. How do you design a proper on-call escalation system?

Advanced

Ans:

Key Discussion Points:

Integrate CloudWatch alarms with PagerDuty or OpsGenie via SNS HTTP endpoint
Use AWS Chatbot to send alarm notifications to Slack with actionable links
Configure multi-level escalation in PagerDuty (L1 → L2 → Manager)
Use SNS FIFO topics for ordered, deduplicated alarm notifications
Set up alarm action to trigger a Lambda that pages on-call via SMS
Create runbooks linked in alarm descriptions for quick remediation
Use CloudWatch alarm action to invoke SSM Automation for auto-remediation

Sample Answer: “I’d integrate CloudWatch alarms with PagerDuty through SNS. The alarm SNS notification triggers a PagerDuty incident, which pages the on-call engineer via mobile push notification. If unacknowledged in 15 minutes, it escalates to the secondary on-call. I’d also set up AWS Chatbot for Slack notifications with one-click runbook links.”

Q15

Your application produced 50GB of logs last month. You need to find all requests that took longer than 5 seconds and identify the top 10 slowest API endpoints. How do you do this efficiently?

Advanced

Ans:

Key Discussion Points:

Use CloudWatch Logs Insights query language
Parse structured JSON logs with parse @message or use fields for JSON auto-parsing
Query: fields @timestamp, endpoint, duration | filter duration > 5000 | stats count(*) as slowCount, avg(duration) as avgDuration by endpoint | sort avgDuration desc | limit 10
Use stats aggregation to group by endpoint
Save frequently used queries to CloudWatch Logs Insights saved queries
Schedule Insights queries via Lambda for regular reporting
Export results to S3 for long-term analysis in Athena

Sample Answer: “I’d use CloudWatch Logs Insights with a query that filters for duration > 5000, groups by endpoint using stats, sorts by average duration descending, and limits to 10 results. For recurring analysis, I’d schedule the query via a Lambda function and export results to S3.”

🔥 Prometheus

16. Prometheus Scrape Target Down

Scenario: Prometheus shows a target as DOWN in the Targets UI. The service is running and accessible. How do you troubleshoot?

Key Discussion Points:

Check the error message in Prometheus Targets UI (/targets page)
Common errors: connection refused, timeout, TLS mismatch, 401 Unauthorized
Verify the scrape config job_name, targets, and metrics_path in prometheus.yml
Confirm the exporter is listening on the correct port: curl http://<host>:<port>/metrics
Check network policies/security groups blocking Prometheus → target communication
If using service discovery, verify labels and relabeling rules
Check scrape_timeout vs exporter response time

Sample Answer: “I’d first check the error message on the Targets page. If it’s a connection timeout, I’d verify network connectivity from Prometheus to the target. If it’s an authentication error, I’d check the scrape config for bearer_token or basic_auth settings. I’d also curl the metrics endpoint directly from the Prometheus host to isolate the issue.”

Tool	Key Topics Covered
AWS CloudWatch	Alarms, Logs, Synthetics, Metric Math, Anomaly Detection, Cross-Account, Container Insights
Prometheus	PromQL, Recording Rules, Alertmanager, Federation, Thanos, High Cardinality, HA Setup
Grafana	Unified Alerting, Provisioning, Loki, Tempo, RBAC, Variables, Correlation, Plugins

📊 Monitoring & Observability Interview Questions

🔥 Prometheus

16. Prometheus Scrape Target Down

17. High Cardinality Label Problem

18. Recording Rules for Performance Optimization

19. Alertmanager Routing & Silencing

20. Prometheus Federation for Multi-Cluster Monitoring

21. Memory Consumption by Prometheus Server

22. Kubernetes Pod Monitoring with Prometheus

23. Custom Application Metrics Instrumentation

24. Prometheus Storage Retention Strategy

25. Rate vs irate vs increase in PromQL

26. Prometheus Remote Write to Thanos

27. Blackbox Exporter for External Probing

28. Alerting on Absent Metrics

29. Node Exporter for Host-Level Metrics

30. Prometheus HA Setup

📈 Grafana

31. Dashboard Not Showing Data

32. Grafana Alerting with Multiple Conditions

33. Multi-Tenant Grafana Setup with Organizations

34. Variable-Based Dynamic Dashboards

35. Grafana Loki for Log Aggregation

36. Grafana Tempo for Distributed Tracing

37. Embedding Grafana Panels in External Apps

38. Annotations for Deployment Markers

39. Grafana Provisioning via Code (GitOps)

40. Grafana Unified Alerting Migration

41. High-Load Dashboard Performance

42. Role-Based Access Control in Grafana

43. Correlating Metrics, Logs, and Traces in Grafana

44. Grafana Reporting for Stakeholders

45. Custom Grafana Plugin Development

📝 Summary

Add More Questions to This Guide

Related Articles

Prometheus & Grafana Interview Questions & Answers (2026) Part 01

AWS Interview Questions & Answers (2026)

CI/CD & DevOps Interview Questions & Answers (2026)