AWS VPC Scenario Based Interview Questions & Answers (2026)
14+ VPC scenario based interview questions and answers— Basic to Advanced.
Senior DevOps / Cloud Engineer Edition — Real-World Examples
Comprehensive AWS VPC scenarios asked at top tech companies. Every answer includes architecture diagrams, real CLI commands, Terraform code, and production war-story explanations.
Ans:
Architecture Overview:
Region: us-east-1
VPC CIDR: 10.0.0.0/16 (65,536 IPs)
│
├── Availability Zone A (us-east-1a)
│ ├── Public Subnet A 10.0.0.0/24 (254 IPs) → ALB, NAT GW, Bastion
│ ├── Private App Subnet A 10.0.10.0/24 (254 IPs) → EC2/ECS App Servers
│ └── Private DB Subnet A 10.0.20.0/24 (254 IPs) → RDS Primary, ElastiCache
│
├── Availability Zone B (us-east-1b)
│ ├── Public Subnet B 10.0.1.0/24 (254 IPs) → ALB, NAT GW
│ ├── Private App Subnet B 10.0.11.0/24 (254 IPs) → EC2/ECS App Servers
│ └── Private DB Subnet B 10.0.21.0/24 (254 IPs) → RDS Standby, ElastiCache
│
└── Availability Zone C (us-east-1c)
├── Public Subnet C 10.0.2.0/24 (254 IPs) → ALB
├── Private App Subnet C 10.0.12.0/24 (254 IPs) → EC2/ECS App Servers
└── Private DB Subnet C 10.0.22.0/24 (254 IPs) → RDS Read Replica
Internet Gateway (IGW) → attached to VPC
NAT Gateway × 3 → one per AZ (HA)
Traffic Flow:
Internet
│
▼
Internet Gateway (IGW)
│
▼
Application Load Balancer (ALB) ← in Public Subnets (A, B, C)
│
▼ (HTTP/HTTPS only, port 80/443)
Auto Scaling Group (ASG) ← in Private App Subnets
│
▼ (port 5432/3306 only)
RDS Multi-AZ / ElastiCache ← in Private DB Subnets
│
▼ (outbound only via NAT)
NAT Gateway ← in Public Subnets
│
▼
Internet (for patches, API calls)
Terraform Implementation:
# VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "prod-vpc"
Environment = "production"
ManagedBy = "terraform"
}
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "prod-igw" }
}
# Public Subnets
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true # instances get public IP automatically
tags = {
Name = "prod-public-${data.aws_availability_zones.available.names[count.index]}"
Tier = "public"
# Required for EKS auto-discovery
"kubernetes.io/role/elb" = "1"
}
}
# Private App Subnets
resource "aws_subnet" "private_app" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 10}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "prod-private-app-${data.aws_availability_zones.available.names[count.index]}"
Tier = "private-app"
"kubernetes.io/role/internal-elb" = "1"
}
}
# Private DB Subnets
resource "aws_subnet" "private_db" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${count.index + 20}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "prod-private-db-${data.aws_availability_zones.available.names[count.index]}"
Tier = "private-db"
}
}
# Elastic IPs for NAT Gateways
resource "aws_eip" "nat" {
count = 3
domain = "vpc"
tags = { Name = "prod-nat-eip-${count.index + 1}" }
}
# NAT Gateways (one per AZ for HA)
resource "aws_nat_gateway" "main" {
count = 3
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = {
Name = "prod-nat-${data.aws_availability_zones.available.names[count.index]}"
}
depends_on = [aws_internet_gateway.main]
}
CIDR Planning Best Practices:
/16 → 65,534 usable IPs — good for large enterprise VPCs
/24 → 254 usable IPs — good per subnet (AWS reserves 5 per subnet)
/20 → 4,091 usable IPs — good for large EKS node subnets
AWS reserves 5 IPs per subnet:
x.x.x.0 → Network address
x.x.x.1 → VPC router
x.x.x.2 → DNS server
x.x.x.3 → Reserved for future use
x.x.x.255 → Broadcast (not supported, reserved)
Ans: IP Addressing Strategy — Non-overlapping CIDRs:
Organization: 172.16.0.0/12 (AWS recommends RFC 1918 ranges)
Production VPC: 172.16.0.0/16 (65,534 IPs)
Staging VPC: 172.17.0.0/16 (65,534 IPs)
Dev VPC: 172.18.0.0/16 (65,534 IPs)
Shared Svcs VPC: 172.19.0.0/16 (65,534 IPs)
Sandbox VPC: 172.20.0.0/16 (65,534 IPs)
Future expansion:
172.21.0.0/16 → 172.31.0.0/16 (11 VPCs reserved)
Why non-overlapping matters:
❌ PROBLEM: Two VPCs with same CIDR 10.0.0.0/16 CANNOT be peered
VPC A: 10.0.0.0/16
VPC B: 10.0.0.0/16
→ aws ec2 create-vpc-peering-connection → FAILS
✅ SOLUTION: Always plan CIDRs at the org level before creating any VPC
Document in IPAM (IP Address Management) tool:
- AWS VPC IP Address Manager (IPAM) service
- NetBox (open source)
- Infoblox (enterprise)
Using AWS IPAM:
# Terraform: AWS IPAM for centralized IP management
resource "aws_vpc_ipam" "main" {
operating_regions {
region_name = "us-east-1"
}
operating_regions {
region_name = "us-west-2"
}
}
resource "aws_vpc_ipam_pool" "top_level" {
address_family = "ipv4"
ipam_scope_id = aws_vpc_ipam.main.private_default_scope_id
locale = "us-east-1"
}
resource "aws_vpc_ipam_pool_cidr" "top_level" {
ipam_pool_id = aws_vpc_ipam_pool.top_level.id
cidr = "172.16.0.0/12"
}
# Production pool: carved from top-level
resource "aws_vpc_ipam_pool" "prod" {
address_family = "ipv4"
ipam_scope_id = aws_vpc_ipam.main.private_default_scope_id
locale = "us-east-1"
source_ipam_pool_id = aws_vpc_ipam_pool.top_level.id
}
resource "aws_vpc_ipam_pool_cidr" "prod" {
ipam_pool_id = aws_vpc_ipam_pool.prod.id
cidr = "172.16.0.0/16"
}
# VPC gets CIDR from IPAM automatically
resource "aws_vpc" "prod" {
ipv4_ipam_pool_id = aws_vpc_ipam_pool.prod.id
ipv4_netmask_length = 16
tags = { Name = "prod-vpc" }
}
Ans:
Public vs Private Subnet — The KEY difference:
Public Subnet:
✅ Route table has: 0.0.0.0/0 → Internet Gateway (IGW)
✅ Instance has a public/Elastic IP
→ Instance can send/receive traffic to/from internet directly
Private Subnet:
✅ Route table has: 0.0.0.0/0 → NAT Gateway (in public subnet)
❌ Instance has NO public IP
→ Instance can initiate outbound internet connections (via NAT)
→ Internet CANNOT initiate connections to private instances
Debugging EC2 in private subnet with no internet access:
# Step 1: Check from the instance (if you can SSH in)
# SSH via bastion or SSM Session Manager
aws ssm start-session --target i-1234567890abcdef0
# On the instance:
curl -m 5 https://checkip.amazonaws.com # test internet
ping 8.8.8.8 # test basic connectivity
curl -m 5 https://s3.amazonaws.com # test S3 access
traceroute 8.8.8.8 # trace the path
# Step 2: Check instance subnet
aws ec2 describe-instances \
--instance-ids i-1234567890abcdef0 \
--query 'Reservations[0].Instances[0].{SubnetId:SubnetId,VpcId:VpcId,PrivateIP:PrivateIpAddress}'
# Step 3: Check the subnet's route table
SUBNET_ID=subnet-0abc123
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=${SUBNET_ID}" \
--query 'RouteTables[0].Routes'
# Look for: { "DestinationCidrBlock": "0.0.0.0/0", "NatGatewayId": "nat-xxx" }
# Missing 0.0.0.0/0 route = NO internet access
# Step 4: If NAT route exists, check NAT Gateway status
NAT_GW_ID=nat-0abc123def456
aws ec2 describe-nat-gateways --nat-gateway-ids $NAT_GW_ID \
--query 'NatGateways[0].State'
# Must be: "available"
# If "failed" or "deleted" → NAT Gateway is the problem
# Step 5: Check NAT Gateway is in a PUBLIC subnet
aws ec2 describe-nat-gateways --nat-gateway-ids $NAT_GW_ID \
--query 'NatGateways[0].SubnetId'
# Verify this subnet has route: 0.0.0.0/0 → igw-xxx
# Step 6: Check Security Group — outbound rules
aws ec2 describe-security-groups \
--group-ids sg-0abc123 \
--query 'SecurityGroups[0].IpPermissionsEgress'
# Must have: 0.0.0.0/0 port all (or at least port 80/443)
# Step 7: Check NACL — outbound and inbound rules
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=${SUBNET_ID}"
# Check both ingress AND egress rules
# NACLs are stateless — need BOTH directions allowed
# Return traffic uses ephemeral ports 1024-65535
# Common Fix: Add default route to route table
aws ec2 create-route \
--route-table-id rtb-0abc123 \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-0abc123def456
Route Table Configuration (Terraform):
# Public route table
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id # → IGW for public
}
tags = { Name = "prod-public-rt" }
}
# Private route tables (one per AZ for AZ-local NAT routing)
resource "aws_route_table" "private" {
count = 3
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id # → NAT GW
}
tags = {
Name = "prod-private-rt-${data.aws_availability_zones.available.names[count.index]}"
}
}
# Associate subnets with route tables
resource "aws_route_table_association" "public" {
count = 3
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private_app" {
count = 3
subnet_id = aws_subnet.private_app[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
Ans:
Understanding NAT Gateway pricing:
NAT Gateway costs:
Hourly charge: $0.045/hour per NAT GW × 3 AZs × 730 hrs = $98.55/month
Data processing: $0.045/GB processed through NAT GW
Data transfer: Standard AWS data transfer rates
$800/month breakdown:
3× NAT GW hourly: ~$99
Data processing: ~$701 (= 15,578 GB ≈ 15.6 TB processed!)
Diagnosis — find who is using NAT:
# Enable VPC Flow Logs and query with Athena
# Find top talkers going through NAT Gateway
# CloudWatch query for NAT Gateway bytes
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=nat-0abc123 \
--statistics Sum \
--period 3600 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-31T00:00:00Z
# Flow logs Athena query to find top destinations
SELECT
dstaddr,
SUM(bytes) AS total_bytes,
COUNT(*) AS connections
FROM vpc_flow_logs
WHERE srcaddr LIKE '10.0.%' -- from private subnets
AND action = 'ACCEPT'
GROUP BY dstaddr
ORDER BY total_bytes DESC
LIMIT 20;
Common root causes and fixes:
Problem 1: EC2 → S3 traffic going through NAT (biggest win)
# Traffic: EC2 → NAT Gateway → Internet → S3
# Fix: Use S3 Gateway Endpoint (FREE! No data processing charges)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-0abc123 rtb-0def456 rtb-0ghi789 \
--vpc-endpoint-type Gateway
# Terraform:
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = concat(
aws_route_table.private[*].id,
[aws_route_table.public.id]
)
tags = { Name = "prod-s3-endpoint" }
}
# Result: All S3 traffic bypasses NAT GW → $0 data processing cost
Problem 2: EC2 → DynamoDB through NAT
# Same fix: DynamoDB Gateway Endpoint (also FREE)
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
Problem 3: EC2 → other AWS services (CloudWatch, ECR, SSM, etc.) through NAT
# Fix: Interface VPC Endpoints for AWS services
# Cost: $0.01/hr per endpoint (much less than NAT data charges for heavy traffic)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private_app[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
resource "aws_vpc_endpoint" "cloudwatch_logs" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.logs"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private_app[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
Problem 4: Data transfer between AZs through NAT
# EC2 in AZ-A sending data through NAT GW in AZ-B
# Fix: Use AZ-local NAT Gateway routing (each AZ has its own NAT GW)
# Each private subnet route table points to NAT GW in SAME AZ
# Avoids cross-AZ data transfer charges ($0.01/GB)
Cost savings summary:
Before optimization: $800/month
After:
S3 Gateway Endpoint: -$300/month (removed NAT S3 traffic)
DynamoDB Endpoint: -$150/month
ECR/CloudWatch Endpoints: -$100/month
AZ-local NAT routing: -$50/month
─────────────
Total savings: -$600/month
New cost: $200/month (75% reduction)
Ans:
┌─────────────────────────────────────────────────────────────────────┐
│ SECURITY GROUP vs NACL │
├───────────────────────────┬─────────────────────────────────────────┤
│ SECURITY GROUP │ NACL │
├───────────────────────────┼─────────────────────────────────────────┤
│ Instance-level firewall │ Subnet-level firewall │
│ Stateful (tracks conn.) │ Stateless (each direction explicit) │
│ Allow rules only │ Allow AND Deny rules │
│ All rules evaluated │ Rules processed in number order │
│ Applied to ENI │ Applied to subnet boundary │
│ Default: deny all inbound │ Default VPC NACL: allow all │
│ Default: allow all out │ Custom NACL: deny all (both ways) │
└───────────────────────────┴─────────────────────────────────────────┘
Real scenario — E-commerce app with compliance requirements:
Scenario: PCI-DSS compliant payment processing
Requirements:
- Only HTTPS (443) from internet to web tier
- Only port 8080 from web tier to app tier
- Only port 5432 (PostgreSQL) from app tier to DB tier
- BLOCK specific known malicious IPs at subnet boundary
- Log all traffic for compliance
Security Groups (stateful — instance level):
# Web Tier Security Group
resource "aws_security_group" "web" {
name = "prod-web-sg"
description = "Web tier - ALB to EC2"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTPS from internet"
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTP redirect to HTTPS"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
}
# App Tier Security Group - only from web tier
resource "aws_security_group" "app" {
name = "prod-app-sg"
vpc_id = aws_vpc.main.id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.web.id] # only from web SG!
description = "App traffic from web tier only"
}
}
# DB Tier Security Group - only from app tier
resource "aws_security_group" "db" {
name = "prod-db-sg"
vpc_id = aws_vpc.main.id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id] # only from app SG!
description = "PostgreSQL from app tier only"
}
}
NACLs (stateless — subnet level):
# NACL for public subnet — additional layer + block bad IPs
resource "aws_network_acl" "public" {
vpc_id = aws_vpc.main.id
subnet_ids = aws_subnet.public[*].id
# DENY known malicious IPs first (low rule numbers = evaluated first)
ingress {
rule_no = 50
protocol = "-1"
action = "deny"
cidr_block = "198.51.100.0/24" # known malicious CIDR
from_port = 0
to_port = 0
}
# ALLOW HTTPS inbound
ingress {
rule_no = 100
protocol = "tcp"
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 443
to_port = 443
}
# ALLOW HTTP inbound (for redirect)
ingress {
rule_no = 110
protocol = "tcp"
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 80
to_port = 80
}
# ALLOW return traffic (ephemeral ports) — STATELESS so needed!
ingress {
rule_no = 120
protocol = "tcp"
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 1024
to_port = 65535
}
# ALLOW HTTPS outbound (to internet for NAT traffic)
egress {
rule_no = 100
protocol = "tcp"
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 443
to_port = 443
}
# ALLOW return traffic outbound (for responses to clients)
egress {
rule_no = 110
protocol = "tcp"
action = "allow"
cidr_block = "0.0.0.0/0"
from_port = 1024
to_port = 65535
}
}
# DB Subnet NACL — very restrictive
resource "aws_network_acl" "db" {
vpc_id = aws_vpc.main.id
subnet_ids = aws_subnet.private_db[*].id
# Only allow from app subnet CIDR
ingress {
rule_no = 100
protocol = "tcp"
action = "allow"
cidr_block = "10.0.10.0/22" # app subnet range
from_port = 5432
to_port = 5432
}
# Return traffic to app subnets
egress {
rule_no = 100
protocol = "tcp"
action = "allow"
cidr_block = "10.0.10.0/22"
from_port = 1024
to_port = 65535
}
# DENY everything else (implicit, but explicit is better for audits)
ingress {
rule_no = 32766
protocol = "-1"
action = "deny"
cidr_block = "0.0.0.0/0"
from_port = 0
to_port = 0
}
}
When NACL beats Security Group:
1. Block a specific IP range quickly (can't deny with SG)
2. Subnet-level logging boundary
3. Emergency lockdown of entire subnet
4. Defense in depth for compliance (PCI-DSS, HIPAA)
5. Prevent lateral movement between subnets (if SG misconfigured)
Ans:
VPC Peering Setup:
Production VPC: 10.0.0.0/16 (us-east-1)
Analytics VPC: 172.16.0.0/16 (us-east-1)
VPC Peering Connection
Production VPC ◄────────────────────────► Analytics VPC
10.0.0.0/16 172.16.0.0/16
Only traffic allowed:
Analytics EC2 (172.16.x.x) → Prod DB (10.0.20.x) on port 5432
Return traffic (stateful SG handles this)
Traffic NOT allowed:
Analytics → Prod App Servers (not needed)
Analytics → Internet via Prod NAT (no transitive routing!)
Step-by-step setup:
# Step 1: Create peering connection
aws ec2 create-vpc-peering-connection \
--vpc-id vpc-prod-0abc123 \
--peer-vpc-id vpc-analytics-0def456 \
--peer-region us-east-1
# Output: pcx-0abc123def456
# Step 2: Accept the peering request
aws ec2 accept-vpc-peering-connection \
--vpc-peering-connection-id pcx-0abc123def456
# Step 3: Add routes in BOTH VPCs
# In Production VPC route tables (DB subnet only):
aws ec2 create-route \
--route-table-id rtb-prod-db \
--destination-cidr-block 172.16.0.0/16 \
--vpc-peering-connection-id pcx-0abc123def456
# In Analytics VPC route tables:
aws ec2 create-route \
--route-table-id rtb-analytics-private \
--destination-cidr-block 10.0.20.0/24 \ # only DB subnet!
--vpc-peering-connection-id pcx-0abc123def456
# Step 4: Update Security Groups
# Prod DB security group — allow analytics VPC CIDR
aws ec2 authorize-security-group-ingress \
--group-id sg-prod-db \
--protocol tcp \
--port 5432 \
--cidr 172.16.0.0/16
Terraform:
resource "aws_vpc_peering_connection" "prod_to_analytics" {
peer_owner_id = data.aws_caller_identity.current.account_id
peer_vpc_id = aws_vpc.analytics.id
vpc_id = aws_vpc.prod.id
auto_accept = true # only works if same account & region
tags = {
Name = "prod-to-analytics-peering"
Side = "Requester"
}
}
# Route from Analytics to Prod DB subnet only
resource "aws_route" "analytics_to_prod_db" {
route_table_id = aws_route_table.analytics_private.id
destination_cidr_block = "10.0.20.0/24" # Only DB subnet
vpc_peering_connection_id = aws_vpc_peering_connection.prod_to_analytics.id
}
VPC Peering Limitations (critical for interview):
❌ NO transitive routing:
VPC-A ─── peered ─── VPC-B ─── peered ─── VPC-C
VPC-A CANNOT reach VPC-C through VPC-B
→ Solution: Use Transit Gateway for hub-and-spoke
❌ NO overlapping CIDRs:
VPC-A: 10.0.0.0/16 + VPC-B: 10.0.0.0/16 → CANNOT peer
→ Plan CIDRs before creating VPCs
❌ NO edge-to-edge routing:
Cannot use VPC peering to route through:
- Another VPC's VPN connection
- Another VPC's Internet Gateway
- Another VPC's NAT Gateway
- Another VPC's AWS Direct Connect
✅ VPC Peering WORKS for:
Low-latency, high-bandwidth VPC-to-VPC traffic
Same or different accounts
Same or different regions (inter-region peering)
No bandwidth bottleneck (not a gateway)
Ans:
The Problem with VPC Peering at scale:
15 VPCs with full mesh peering = n(n-1)/2 = 15×14/2 = 105 peering connections!
Each connection needs:
- Route table entries in BOTH VPCs
- Security group rules
- Independent monitoring
This is unmanageable. Enter Transit Gateway.
Transit Gateway Architecture:
┌──────────────────────────┐
│ TRANSIT GATEWAY │
│ (Central Hub) │
│ │
│ Route Table: │
│ 10.0.0.0/16 → VPC-Prod │
│ 172.16.0.0/16→ VPC-Anl │
│ 192.168.0.0/16→VPC-Dev │
│ 10.200.0.0/16→On-Prem │
└─────────┬────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
VPC-Production VPC-Analytics VPC-Dev
(10.0.0.0/16) (172.16.0.0/16) (192.168.0.0/16)
Account: Prod Account: Data Account: Dev
│ │ │
▼ ▼ ▼
TGW Attachment TGW Attachment TGW Attachment
(subnet in each AZ) (subnet in each AZ) (subnet in each AZ)
│
▼
VPN/Direct Connect
(On-Premises: 10.200.0.0/16)
Terraform Implementation:
# Create Transit Gateway (in shared network account)
resource "aws_ec2_transit_gateway" "main" {
description = "Central TGW for all VPCs"
amazon_side_asn = 64512
default_route_table_association = "disable" # we manage route tables manually
default_route_table_propagation = "disable"
dns_support = "enable"
vpn_ecmp_support = "enable"
multicast_support = "disable"
tags = {
Name = "org-central-tgw"
Environment = "shared"
ManagedBy = "terraform"
}
}
# Share TGW with other accounts via RAM
resource "aws_ram_resource_share" "tgw" {
name = "tgw-share"
allow_external_principals = false # org accounts only
}
resource "aws_ram_resource_association" "tgw" {
resource_arn = aws_ec2_transit_gateway.main.arn
resource_share_arn = aws_ram_resource_share.tgw.arn
}
# Attach Production VPC to TGW
resource "aws_ec2_transit_gateway_vpc_attachment" "prod" {
subnet_ids = aws_subnet.prod_private[*].id # attach in each AZ
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = aws_vpc.prod.id
transit_gateway_default_route_table_association = false
transit_gateway_default_route_table_propagation = false
tags = { Name = "prod-vpc-tgw-attachment" }
}
# Custom TGW Route Tables for traffic segmentation
# Prod Route Table — can reach shared services, NOT dev
resource "aws_ec2_transit_gateway_route_table" "prod" {
transit_gateway_id = aws_ec2_transit_gateway.main.id
tags = { Name = "prod-tgw-rt" }
}
# Associate prod VPC with prod route table
resource "aws_ec2_transit_gateway_route_table_association" "prod" {
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.prod.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.prod.id
}
# Static routes in TGW route table
resource "aws_ec2_transit_gateway_route" "prod_to_shared" {
destination_cidr_block = "172.16.0.0/16" # Shared services VPC
transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.shared.id
transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.prod.id
}
# Route in VPC to send traffic through TGW
resource "aws_route" "prod_to_tgw" {
count = length(aws_route_table.prod_private)
route_table_id = aws_route_table.prod_private[count.index].id
destination_cidr_block = "10.200.0.0/8" # all internal traffic to TGW
transit_gateway_id = aws_ec2_transit_gateway.main.id
}
Network Segmentation with TGW Route Tables:
TGW Route Table: PROD
→ Can reach: Shared-Services VPC, On-Premises
→ Cannot reach: Dev VPC, Staging VPC
TGW Route Table: DEV
→ Can reach: Shared-Services VPC only
→ Cannot reach: Prod VPC, On-Premises
TGW Route Table: SHARED-SERVICES
→ Can reach: All VPCs (it serves them all)
This enforces: Dev cannot accidentally connect to Prod
Even if someone misconfigures a security group
Ans:
Without endpoints (bad):
Pod → Private Subnet → NAT Gateway → Internet → ECR/CloudWatch
Cost: NAT data processing charges + internet latency
Risk: Traffic leaves AWS network
With endpoints (good):
Pod → Private Subnet → VPC Endpoint → ECR/CloudWatch (stays in AWS network)
Cost: Interface endpoint hourly charge ($0.01/hr/AZ)
Benefit: No NAT cost, lower latency, more secure
Required endpoints for EKS + ECR + CloudWatch:
locals {
# All endpoints needed for EKS nodes in private subnets
interface_endpoints = [
"ecr.api", # ECR API calls
"ecr.dkr", # ECR Docker registry (image pulls)
"ec2", # EC2 API for node bootstrap
"ec2messages", # SSM Session Manager
"ssm", # Systems Manager
"ssmmessages", # SSM Session Manager WebSocket
"logs", # CloudWatch Logs
"monitoring", # CloudWatch Metrics
"sts", # Security Token Service (IRSA)
"elasticloadbalancing", # ALB controller
"autoscaling", # Cluster autoscaler
"xray", # X-Ray tracing
"secretsmanager", # Secrets Manager
"kms", # KMS for envelope encryption
]
}
# Security group for VPC endpoints
resource "aws_security_group" "vpc_endpoints" {
name = "vpc-endpoints-sg"
description = "Allow HTTPS from VPC to endpoints"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [aws_vpc.main.cidr_block]
description = "HTTPS from VPC CIDR"
}
}
# Create all interface endpoints
resource "aws_vpc_endpoint" "interface" {
for_each = toset(local.interface_endpoints)
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.${each.key}"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private_app[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true # Critical: overrides DNS so no code changes needed!
tags = {
Name = "prod-endpoint-${each.key}"
}
}
# Gateway endpoints (FREE — always use these!)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = concat(
aws_route_table.private[*].id,
[aws_route_table.public.id]
)
tags = { Name = "prod-s3-gateway-endpoint" }
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
tags = { Name = "prod-dynamodb-gateway-endpoint" }
}
Verify endpoints are working:
# From an EC2/pod in private subnet (no internet access):
# Test ECR endpoint
aws ecr get-login-password --region us-east-1
# Should work WITHOUT internet access
# Test S3 endpoint
aws s3 ls s3://my-bucket
# Should work WITHOUT internet access
# Check DNS resolution (should resolve to private IP)
nslookup ecr.api.us-east-1.amazonaws.com
# Should return 10.x.x.x (VPC private IP), not public IP
# If returns public IP: private_dns_enabled = false or DNS issue
# Test CloudWatch Logs
aws logs put-log-events \
--log-group-name /eks/my-cluster \
--log-stream-name test-stream \
--log-events timestamp=$(date +%s000),message="test"
Ans:
┌──────────────────────────────────────────────────────────────────┐
│ VPN vs Direct Connect Comparison │
├─────────────────────┬────────────────────────────────────────────┤
│ Site-to-Site VPN │ AWS Direct Connect │
├─────────────────────┼────────────────────────────────────────────┤
│ Over public internet│ Dedicated private fiber connection │
│ Encrypted (IPsec) │ Not encrypted by default (add VPN on top) │
│ Up to 1.25 Gbps │ 1 Gbps to 100 Gbps │
│ ~$36/month │ $200-$2000+/month + port fee │
│ Setup: hours │ Setup: weeks to months │
│ SLA: none │ SLA: 99.99% with redundant connections │
│ Latency: variable │ Latency: consistent, low │
│ Good for: dev/test │ Good for: production, large data transfer │
└─────────────────────┴────────────────────────────────────────────┘
Production Recommended: Direct Connect + VPN Backup:
On-Premises DC
│
├──── AWS Direct Connect (primary) ──── Virtual Private Gateway
│ 1 Gbps dedicated fiber │
│ BGP routing │
│ ▼
└──── Site-to-Site VPN (backup) ──────── AWS VPC
over internet (10.0.0.0/16)
IPsec encrypted
auto-failover via BGP
Site-to-Site VPN Setup:
# Step 1: Create Customer Gateway (represents your on-prem VPN device)
aws ec2 create-customer-gateway \
--type ipsec.1 \
--public-ip 203.0.113.10 \ # your on-prem public IP
--bgp-asn 65000 \
--tag-specifications 'ResourceType=customer-gateway,Tags=[{Key=Name,Value=corp-dc-cgw}]'
# Step 2: Create Virtual Private Gateway
aws ec2 create-vpn-gateway \
--type ipsec.1 \
--amazon-side-asn 64512
# Attach to VPC
aws ec2 attach-vpn-gateway \
--vpn-gateway-id vgw-0abc123 \
--vpc-id vpc-0abc123
# Step 3: Create VPN Connection
aws ec2 create-vpn-connection \
--type ipsec.1 \
--customer-gateway-id cgw-0abc123 \
--vpn-gateway-id vgw-0abc123 \
--options StaticRoutesOnly=false # Use BGP for dynamic routing
# Step 4: Download VPN configuration for your device
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpx-0abc123
# Step 5: Enable route propagation in route tables
aws ec2 enable-vgw-route-propagation \
--route-table-id rtb-0abc123 \
--gateway-id vgw-0abc123
# On-prem routes now automatically appear in VPC route table via BGP!
Terraform:
resource "aws_customer_gateway" "on_prem" {
bgp_asn = 65000
ip_address = "203.0.113.10" # on-prem public IP
type = "ipsec.1"
tags = { Name = "corp-datacenter-cgw" }
}
resource "aws_vpn_gateway" "main" {
vpc_id = aws_vpc.main.id
amazon_side_asn = 64512
tags = { Name = "prod-vgw" }
}
resource "aws_vpn_connection" "main" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.on_prem.id
type = "ipsec.1"
static_routes_only = false # BGP
tags = { Name = "corp-to-aws-vpn" }
}
# Enable BGP route propagation
resource "aws_vpn_gateway_route_propagation" "private" {
count = length(aws_route_table.private)
route_table_id = aws_route_table.private[count.index].id
vpn_gateway_id = aws_vpn_gateway.main.id
}
Ans:
VPC DNS fundamentals:
Every VPC has:
DNS Server IP: VPC CIDR base + 2
Example: VPC 10.0.0.0/16 → DNS at 10.0.0.2
Default DNS names for EC2:
Private: ip-10-0-1-100.us-east-1.compute.internal
Public: ec2-54-123-45-67.compute-1.amazonaws.com
Required VPC settings for custom DNS:
enableDnsSupport = true (use AWS DNS server)
enableDnsHostnames = true (assign DNS hostnames to instances)
Private Hosted Zone for microservices:
# Create private hosted zone for your domain
resource "aws_route53_zone" "internal" {
name = "internal.mycompany.com"
comment = "Private hosted zone for VPC service discovery"
vpc {
vpc_id = aws_vpc.main.id
}
# Associate with additional VPCs (peered or TGW-connected)
vpc {
vpc_id = aws_vpc.staging.id
}
tags = { Name = "internal-dns-zone" }
}
# DNS records for services
resource "aws_route53_record" "payment_service" {
zone_id = aws_route53_zone.internal.zone_id
name = "payment.internal.mycompany.com"
type = "A"
alias {
name = aws_lb.payment.dns_name
zone_id = aws_lb.payment.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "database" {
zone_id = aws_route53_zone.internal.zone_id
name = "postgres.internal.mycompany.com"
type = "CNAME"
ttl = 60
records = [aws_db_instance.prod.address]
}
# SRV record for service discovery (gRPC/Consul style)
resource "aws_route53_record" "auth_srv" {
zone_id = aws_route53_zone.internal.zone_id
name = "_grpc._tcp.auth.internal.mycompany.com"
type = "SRV"
ttl = 30
records = ["10 5 50051 auth.internal.mycompany.com"]
}
Route 53 Resolver for hybrid DNS (on-prem ↔ AWS):
# Inbound endpoint: on-prem DNS queries → AWS private zones
resource "aws_route53_resolver_endpoint" "inbound" {
name = "inbound-from-onprem"
direction = "INBOUND"
security_group_ids = [aws_security_group.resolver.id]
# Place resolvers in multiple AZs for HA
ip_address {
subnet_id = aws_subnet.private_app[0].id
}
ip_address {
subnet_id = aws_subnet.private_app[1].id
}
}
# Outbound endpoint: AWS → on-prem DNS queries
resource "aws_route53_resolver_endpoint" "outbound" {
name = "outbound-to-onprem"
direction = "OUTBOUND"
security_group_ids = [aws_security_group.resolver.id]
ip_address {
subnet_id = aws_subnet.private_app[0].id
}
ip_address {
subnet_id = aws_subnet.private_app[1].id
}
}
# Forward on-prem domain queries to on-prem DNS
resource "aws_route53_resolver_rule" "forward_onprem" {
domain_name = "corp.mycompany.com" # on-prem domain
name = "forward-to-onprem-dns"
rule_type = "FORWARD"
resolver_endpoint_id = aws_route53_resolver_endpoint.outbound.id
target_ip {
ip = "10.200.1.10" # on-prem DNS server 1
port = 53
}
target_ip {
ip = "10.200.1.11" # on-prem DNS server 2
port = 53
}
}
Answer:
Enable VPC Flow Logs:
# CloudWatch Logs destination for flow logs
resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
name = "/aws/vpc/flow-logs/${aws_vpc.main.id}"
retention_in_days = 90
kms_key_id = aws_kms_key.logs.arn
}
resource "aws_flow_log" "main" {
iam_role_arn = aws_iam_role.flow_logs.arn
log_destination = aws_cloudwatch_log_group.vpc_flow_logs.arn
traffic_type = "ALL" # ACCEPT, REJECT, or ALL
vpc_id = aws_vpc.main.id
# Enhanced format (v3+) includes more fields
log_format = "$${version} $${account-id} $${interface-id} $${srcaddr} $${dstaddr} $${srcport} $${dstport} $${protocol} $${packets} $${bytes} $${start} $${end} $${action} $${log-status} $${vpc-id} $${subnet-id} $${instance-id} $${tcp-flags} $${type} $${pkt-srcaddr} $${pkt-dstaddr}"
tags = { Name = "prod-vpc-flow-logs" }
}
# Also send to S3 for Athena querying (better for large-scale analysis)
resource "aws_flow_log" "s3" {
log_destination_type = "s3"
log_destination = "${aws_s3_bucket.flow_logs.arn}/flow-logs/"
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
}
Investigating data exfiltration:
# Scenario: Suspicious EC2 instance i-0abc123 sending large amounts of data
# to an external IP at 2 AM
# Query 1: Find all connections from the suspicious instance (CloudWatch Insights)
aws logs start-query \
--log-group-name "/aws/vpc/flow-logs/vpc-0abc123" \
--start-time $(date -d '24 hours ago' +%s) \
--end-time $(date +%s) \
--query-string '
fields @timestamp, srcaddr, dstaddr, dstport, bytes, action
| filter srcaddr = "10.0.10.50" -- suspicious instance IP
| filter action = "ACCEPT"
| sort bytes desc
| limit 50
'
# Query 2: Find large outbound data transfers (potential exfiltration)
aws logs start-query \
--log-group-name "/aws/vpc/flow-logs/vpc-0abc123" \
--query-string '
fields @timestamp, srcaddr, dstaddr, bytes, packets
| filter srcaddr like /^10\.0\./ -- from private subnets
| filter dstaddr not like /^10\./ -- to external IPs
| filter dstaddr not like /^172\./
| filter dstaddr not like /^192\.168\./
| stats sum(bytes) as totalBytes by dstaddr, srcaddr
| sort totalBytes desc
| limit 20
'
# Query 3: Unusual port activity
aws logs start-query \
--query-string '
fields @timestamp, srcaddr, dstaddr, dstport, bytes
| filter srcaddr = "10.0.10.50"
| filter dstport not in [80, 443, 53, 123] -- unusual ports
| filter action = "ACCEPT"
| sort @timestamp desc
'
Athena for large-scale analysis (S3 flow logs):
-- Create Athena table for flow logs
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol int,
packets bigint,
bytes bigint,
start bigint,
end bigint,
action string,
log_status string,
vpc_id string,
subnet_id string,
instance_id string,
tcp_flags int,
pkt_srcaddr string,
pkt_dstaddr string
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/flow-logs/'
TBLPROPERTIES ("skip.header.line.count"="1");
-- Top 10 external IPs receiving data from prod
SELECT
dstaddr AS external_ip,
SUM(bytes) / 1073741824.0 AS total_gb,
COUNT(DISTINCT srcaddr) AS source_instances,
MIN(from_unixtime(start)) AS first_seen,
MAX(from_unixtime(end)) AS last_seen
FROM vpc_flow_logs
WHERE year = '2024' AND month = '01'
AND action = 'ACCEPT'
AND srcaddr LIKE '10.%'
AND dstaddr NOT LIKE '10.%'
AND dstaddr NOT LIKE '172.16.%'
GROUP BY dstaddr
ORDER BY total_gb DESC
LIMIT 10;
-- Detect port scanning (many different ports to same destination)
SELECT
srcaddr,
dstaddr,
COUNT(DISTINCT dstport) AS unique_ports_scanned,
MIN(dstport) AS min_port,
MAX(dstport) AS max_port
FROM vpc_flow_logs
WHERE year = '2024' AND month = '01'
GROUP BY srcaddr, dstaddr
HAVING COUNT(DISTINCT dstport) > 50
ORDER BY unique_ports_scanned DESC;
Ans:
Centralized Inspection Architecture:
Internet
│
▼
Internet Gateway (IGW)
│
▼
┌──────────────────────────────────────┐
│ INSPECTION VPC │
│ 10.100.0.0/16 │
│ │
│ Public Subnet (ALB/NLB) │
│ │ │
│ ▼ │
│ AWS Network Firewall │
│ (stateful + stateless rules) │
│ - Block malicious IPs │
│ - Deep packet inspection │
│ - TLS inspection │
│ - Domain filtering │
│ │ │
│ ▼ │
│ Private Subnet (Firewall Endpoints) │
└──────────────────┬───────────────────┘
│
▼ (via Transit Gateway)
┌─────────┼──────────┐
▼ ▼ ▼
Prod VPC Staging VPC Dev VPC
# Network Firewall Rule Group — stateless (fast path)
resource "aws_networkfirewall_rule_group" "block_bad_ips" {
capacity = 100
name = "block-known-malicious-ips"
type = "STATELESS"
rule_group {
rules_source {
stateless_rules_and_custom_actions {
stateless_rule {
priority = 1
rule_definition {
actions = ["aws:drop"]
match_attributes {
sources {
address_definition = "198.51.100.0/24" # known bad
}
sources {
address_definition = "203.0.113.0/24"
}
}
}
}
}
}
}
}
# Stateful rule group — domain filtering
resource "aws_networkfirewall_rule_group" "domain_filter" {
capacity = 100
name = "allow-approved-domains"
type = "STATEFUL"
rule_group {
rules_source {
rules_source_list {
generated_rules_type = "DENYLIST"
target_types = ["HTTP_HOST", "TLS_SNI"]
targets = [
"malware-distribution.com",
"crypto-miner.net",
".onion",
]
}
}
}
}
# Stateful rule — block SSH from internet
resource "aws_networkfirewall_rule_group" "block_ssh" {
capacity = 50
name = "block-inbound-ssh-rdp"
type = "STATEFUL"
rule_group {
rules_source {
stateful_rule {
action = "DROP"
header {
destination = "ANY"
destination_port = "22"
direction = "ANY"
protocol = "TCP"
source = "ANY"
source_port = "ANY"
}
rule_option {
keyword = "sid"
settings = ["1"]
}
}
}
}
}
# Firewall Policy
resource "aws_networkfirewall_firewall_policy" "main" {
name = "central-firewall-policy"
firewall_policy {
stateless_default_actions = ["aws:forward_to_sfe"]
stateless_fragment_default_actions = ["aws:forward_to_sfe"]
stateless_rule_group_reference {
priority = 1
resource_arn = aws_networkfirewall_rule_group.block_bad_ips.arn
}
stateful_rule_group_reference {
resource_arn = aws_networkfirewall_rule_group.domain_filter.arn
}
stateful_rule_group_reference {
resource_arn = aws_networkfirewall_rule_group.block_ssh.arn
}
}
}
# Network Firewall
resource "aws_networkfirewall_firewall" "main" {
name = "central-network-firewall"
firewall_policy_arn = aws_networkfirewall_firewall_policy.main.arn
vpc_id = aws_vpc.inspection.id
subnet_mapping {
subnet_id = aws_subnet.inspection_a.id
}
subnet_mapping {
subnet_id = aws_subnet.inspection_b.id
}
firewall_policy_change_protection = true # prevent accidental changes
subnet_change_protection = true
}
Ans:
Multi-Region Active-Active Architecture:
Global
│
▼
AWS Global Accelerator (anycast IPs: 75.2.x.x, 99.83.x.x)
│ Routing: latency-based to nearest healthy region
├──────────────────────────────────────────────────────────┐
▼ ▼
Region: us-east-1 Region: eu-west-1
┌─────────────────────────┐ ┌─────────────────────────┐
│ VPC: 10.0.0.0/16 │ │ VPC: 10.1.0.0/16 │
│ │ │ │
│ ALB (internet-facing) │ │ ALB (internet-facing) │
│ ↓ │ │ ↓ │
│ ECS/EKS (3 AZs) │◄─────────── │ ECS/EKS (3 AZs) │
│ ↓ │ VPC Peering │ ↓ │
│ Aurora Global Primary │─────────────►│ Aurora Global Replica │
│ (read + write) │ < 1s lag │ (can be promoted) │
│ ↓ │ │ ↓ │
│ ElastiCache Primary │ │ ElastiCache Replica │
└─────────────────────────┘ └─────────────────────────┘
Inter-Region VPC Peering setup:
# Peering connection between us-east-1 and eu-west-1
resource "aws_vpc_peering_connection" "us_to_eu" {
provider = aws.us-east-1
vpc_id = aws_vpc.us_east.id
peer_vpc_id = aws_vpc.eu_west.id
peer_region = "eu-west-1" # Cross-region peering
auto_accept = false # Must accept in peer region
tags = { Name = "us-east-1-to-eu-west-1" }
}
# Accept peering in eu-west-1
resource "aws_vpc_peering_connection_accepter" "eu_accept" {
provider = aws.eu-west-1
vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
auto_accept = true
}
# Routes in US for EU traffic
resource "aws_route" "us_to_eu" {
provider = aws.us-east-1
count = length(aws_route_table.us_private)
route_table_id = aws_route_table.us_private[count.index].id
destination_cidr_block = "10.1.0.0/16" # EU VPC CIDR
vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
}
# Global Accelerator for anycast routing
resource "aws_globalaccelerator_accelerator" "main" {
name = "my-app-global-accelerator"
ip_address_type = "IPV4"
enabled = true
}
resource "aws_globalaccelerator_listener" "https" {
accelerator_arn = aws_globalaccelerator_accelerator.main.id
protocol = "TCP"
port_range {
from_port = 443
to_port = 443
}
}
resource "aws_globalaccelerator_endpoint_group" "us_east" {
listener_arn = aws_globalaccelerator_listener.https.id
endpoint_group_region = "us-east-1"
traffic_dial_percentage = 50
health_check_path = "/health"
health_check_protocol = "HTTPS"
health_check_interval_seconds = 10
threshold_count = 3
endpoint_configuration {
endpoint_id = aws_lb.us_alb.arn
weight = 100
client_ip_preservation_enabled = true
}
}
Ans:
Systematic Troubleshooting Framework:
Layer 7 (Application): Can the app connect?
Layer 4 (Transport): Is TCP port open? (telnet/nc)
Layer 3 (Network): Can packets reach the destination? (ping)
Layer 2/1 (Physical): Is the instance running? Route table correct?
# STEP 1: Verify both instances are running and in correct subnets
aws ec2 describe-instances \
--instance-ids i-app-1234 i-rds-endpoint \
--query 'Reservations[*].Instances[*].{ID:InstanceId,State:State.Name,Subnet:SubnetId,IP:PrivateIpAddress}'
# STEP 2: Get RDS endpoint and check status
aws rds describe-db-instances \
--db-instance-identifier prod-postgres \
--query 'DBInstances[0].{Endpoint:Endpoint.Address,Port:Endpoint.Port,Status:DBInstanceStatus,VPC:DBSubnetGroup.VpcId}'
# STEP 3: Test connectivity from EC2 (via SSM)
aws ssm start-session --target i-app-1234
# On EC2 instance:
# Test DNS resolution
nslookup prod-postgres.abcdef.us-east-1.rds.amazonaws.com
# Test TCP connectivity (not just ping — RDS blocks ICMP)
nc -zv prod-postgres.abcdef.us-east-1.rds.amazonaws.com 5432
# Expected: Connection to ... 5432 port [tcp/postgresql] succeeded!
# Got: Connection refused / timeout?
# STEP 4: Check Security Groups
# EC2 security group — does it have outbound to port 5432?
aws ec2 describe-security-groups \
--group-ids sg-app-0abc123 \
--query 'SecurityGroups[0].IpPermissionsEgress'
# Must allow: port 5432 to RDS SG or subnet CIDR
# RDS security group — does it allow from EC2 SG?
aws ec2 describe-security-groups \
--group-ids sg-rds-0def456 \
--query 'SecurityGroups[0].IpPermissions'
# Look for: port 5432 from EC2 SG ID or EC2 CIDR
# STEP 5: Check route tables for both subnets
# App subnet route table
APP_SUBNET=subnet-0abc123
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=${APP_SUBNET}" \
--query 'RouteTables[0].Routes'
# DB subnet route table
DB_SUBNET=subnet-0def456
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=${DB_SUBNET}" \
--query 'RouteTables[0].Routes'
# Both should have local route: 10.0.0.0/16 → local ✅
# (traffic within VPC doesn't need any special route)
# STEP 6: Check NACLs on both subnets
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=${APP_SUBNET}"
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=${DB_SUBNET}"
# DB subnet NACL must:
# INBOUND: allow port 5432 from app subnet CIDR
# OUTBOUND: allow ephemeral ports 1024-65535 to app subnet CIDR
# STEP 7: Use VPC Reachability Analyzer (automates all above checks)
aws ec2 create-network-insights-path \
--source i-app-1234 \
--destination <rds-eni-id> \
--protocol TCP \
--destination-port 5432
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0abc123
# Wait and get results
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-0abc123 \
--query 'NetworkInsightsAnalyses[0].{Status:Status,Reachable:NetworkPathFound,Explanations:Explanations}'
# Returns: exact reason if unreachable (e.g., "SG rule missing")
# STEP 8: Common fixes
# Fix A: Add inbound rule to RDS SG
aws ec2 authorize-security-group-ingress \
--group-id sg-rds-0def456 \
--protocol tcp \
--port 5432 \
--source-group sg-app-0abc123
# Fix B: Add outbound rule to App SG
aws ec2 authorize-security-group-egress \
--group-id sg-app-0abc123 \
--protocol tcp \
--port 5432 \
--source-group sg-rds-0def456
# Fix C: Fix NACL — add rule allowing return traffic
aws ec2 create-network-acl-entry \
--network-acl-id acl-0abc123 \
--ingress \
--rule-number 200 \
--protocol tcp \
--rule-action allow \
--cidr-block 10.0.10.0/24 \
--port-range From=5432,To=5432
Answer:
# Flow log REJECT entries look like:
# 2 123456789012 eni-0abc123 10.0.1.50 10.0.20.100 54321 5432 6 5 320 ... REJECT OK
# Meaning: TCP from 10.0.1.50:54321 to 10.0.20.100:5432 was REJECTED
# Step 1: Query rejected traffic in CloudWatch Insights
aws logs start-query \
--log-group-name "/aws/vpc/flow-logs/vpc-0abc123" \
--query-string '
fields @timestamp, srcaddr, srcport, dstaddr, dstport, protocol, bytes, action
| filter action = "REJECT"
| filter dstaddr = "10.0.20.100" -- RDS IP
| sort @timestamp desc
| limit 50
'
# Step 2: Identify the ENI from rejected traffic
# Flow logs include interface-id field
# Map ENI to resource:
aws ec2 describe-network-interfaces \
--filters "Name=private-ip-address,Values=10.0.20.100" \
--query 'NetworkInterfaces[0].{ENI:NetworkInterfaceId,SG:Groups,Instance:Attachment.InstanceId}'
# Step 3: Find which security group is rejecting
# REJECT in flow logs = SG or NACL denying the traffic
# Since SGs are stateful and NACLs are stateless:
# Check if it's an ephemeral port issue (NACL)
# Ephemeral ports: 1024-65535 (Linux) / 49152-65535 (Windows)
# If REJECT is on high port numbers → NACL is blocking return traffic
# Step 4: Systematic SG audit
# List all SGs attached to the RDS ENI
aws ec2 describe-network-interfaces \
--filters "Name=private-ip-address,Values=10.0.20.100" \
--query 'NetworkInterfaces[0].Groups'
# For each SG, check inbound rules
for SG_ID in sg-0abc123 sg-0def456; do
echo "=== $SG_ID ==="
aws ec2 describe-security-groups \
--group-ids $SG_ID \
--query 'SecurityGroups[0].IpPermissions[?FromPort<=`5432` && ToPort>=`5432`]'
done
# Step 5: Use AWS Security Group Analyzer
# Check Security Hub findings for overly permissive or restrictive rules
aws securityhub get-findings \
--filters '{"Type": [{"Value": "Software and Configuration Checks/Industry and Regulatory Standards", "Comparison": "PREFIX"}]}'
Ans:
PrivateLink vs VPC Peering:
VPC Peering:
✅ Full network access between VPCs
❌ Risk: can access unintended resources
❌ Overlapping CIDRs not supported
❌ Transitive routing not supported
PrivateLink (VPC Endpoint Service):
✅ Expose ONLY a specific service/port
✅ Works with overlapping CIDRs (no routing change)
✅ Consumer VPC has no visibility into provider VPC
✅ Scales to thousands of consumers
✅ Cross-account, cross-region
Creating a PrivateLink service (provider side):
# Provider VPC: Service that others want to consume
# Example: Payment Service API that 10 teams need to access
# Step 1: NLB in provider VPC (PrivateLink requires NLB, not ALB)
resource "aws_lb" "payment_nlb" {
name = "payment-service-nlb"
internal = true # internal NLB for PrivateLink
load_balancer_type = "network"
subnets = aws_subnet.provider_private[*].id
}
resource "aws_lb_listener" "payment" {
load_balancer_arn = aws_lb.payment_nlb.arn
port = 443
protocol = "TLS"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.payment.arn
}
}
# Step 2: Create VPC Endpoint Service
resource "aws_vpc_endpoint_service" "payment" {
acceptance_required = true # manually approve each consumer
network_load_balancer_arns = [aws_lb.payment_nlb.arn]
# Allow specific AWS accounts to connect
allowed_principals = [
"arn:aws:iam::111122223333:root", # team-A account
"arn:aws:iam::444455556666:root", # team-B account
]
tags = {
Name = "payment-service-endpoint"
}
}
output "endpoint_service_name" {
value = aws_vpc_endpoint_service.payment.service_name
# e.g., com.amazonaws.vpce.us-east-1.vpce-svc-0abc123
}
Consuming the PrivateLink service (consumer side):
# Consumer VPC: Any team that needs payment service
resource "aws_vpc_endpoint" "payment" {
vpc_id = aws_vpc.consumer.id
service_name = "com.amazonaws.vpce.us-east-1.vpce-svc-0abc123"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.consumer_private[*].id
security_group_ids = [aws_security_group.payment_endpoint.id]
private_dns_enabled = true
tags = { Name = "payment-service-endpoint" }
}
# DNS record for the endpoint (optional, for friendly names)
resource "aws_route53_record" "payment" {
zone_id = aws_route53_zone.internal.zone_id
name = "payment-api.internal.mycompany.com"
type = "A"
alias {
name = aws_vpc_endpoint.payment.dns_entry[0].dns_name
zone_id = aws_vpc_endpoint.payment.dns_entry[0].hosted_zone_id
evaluate_target_health = true
}
}
Ans:
# Enable IPv6 on existing VPC
aws ec2 associate-vpc-cidr-block \
--vpc-id vpc-0abc123 \
--amazon-provided-ipv6-cidr-block
# AWS assigns a /56 block like: 2600:1f18:1234:5600::/56
# Assign /64 to each subnet (required for IPv6 subnets)
aws ec2 associate-subnet-cidr-block \
--subnet-id subnet-0abc123 \
--ipv6-cidr-block 2600:1f18:1234:5600::/64
# Key difference: IPv6 subnets are all "public" by default
# Use Egress-Only Internet Gateway for outbound-only IPv6
resource "aws_egress_only_internet_gateway" "ipv6" {
vpc_id = aws_vpc.main.id
tags = { Name = "prod-eigw" }
}
# Route IPv6 outbound through EIGW (like NAT for IPv6)
resource "aws_route" "ipv6_egress" {
route_table_id = aws_route_table.private.id
destination_ipv6_cidr_block = "::/0"
egress_only_gateway_id = aws_egress_only_internet_gateway.ipv6.id
}
When to enable IPv6:
✅ EKS clusters (Kubernetes assigns pod IPs from VPC — exhausts IPv4 quickly)
✅ IoT applications (billions of devices)
✅ CDN/edge applications (lower latency with native IPv6)
✅ Cost optimization (no NAT Gateway needed for IPv6 outbound)
⚠️ IPv6 considerations:
- All IPv6 addresses are publicly routable (no private range concept)
- Use security groups and NACLs to control access (not obscurity)
- Not all AWS services support IPv6 (check compatibility)
- On-prem must also support IPv6 for hybrid connectivity
Ans:
The Problem:
VPC CIDR: 10.0.0.0/16 = 65,536 IPs
EKS Node: each pod gets a VPC IP (aws-node CNI)
100 nodes × 30 pods = 3,000 pod IPs
EKS node ENIs also consume IPs
→ Can hit limits with medium-sized clusters!
Solution 1: Add secondary CIDR to existing VPC
# Add secondary CIDR block (no downtime!)
aws ec2 associate-vpc-cidr-block \
--vpc-id vpc-0abc123 \
--cidr-block 100.64.0.0/16 # RFC 6598 shared address space, great for pods
# Create new subnets from secondary CIDR
aws ec2 create-subnet \
--vpc-id vpc-0abc123 \
--cidr-block 100.64.0.0/18 \
--availability-zone us-east-1a
# Update EKS to use new subnets for pods
aws eks update-addon --cluster-name prod-cluster \
--addon-name vpc-cni \
--configuration-values '{"env":{"AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG":"true"}}'
Solution 2: VPC CNI Custom Networking (pods use different CIDR)
# ENIConfig for each AZ — pods use secondary CIDR
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1a
spec:
securityGroups:
- sg-0abc123 # pod security group
subnet: subnet-100-64-0-0 # secondary CIDR subnet in AZ-a
---
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
name: us-east-1b
spec:
securityGroups:
- sg-0abc123
subnet: subnet-100-64-64-0 # secondary CIDR subnet in AZ-b
Solution 3: Use prefix delegation (most IPs per node)
# Each node ENI gets a /28 prefix (16 IPs) instead of single IPs
# Dramatically increases pods per node
aws eks update-addon --cluster-name prod-cluster \
--addon-name vpc-cni \
--configuration-values '{
"env": {
"ENABLE_PREFIX_DELEGATION": "true",
"WARM_PREFIX_TARGET": "1"
}
}'
# With prefix delegation:
# m5.xlarge: 3 ENIs × 14 prefixes × 16 IPs = 672 pod IPs
# vs without: 3 ENIs × 14 IPs = 42 pod IPs
# 16x improvement!
💡 Quick Reference: VPC Cheat Sheet
# ─── VPC INSPECTION ──────────────────────────────────────────────────────────
# List all VPCs with CIDRs
aws ec2 describe-vpcs --query 'Vpcs[*].{ID:VpcId,CIDR:CidrBlock,Name:Tags[?Key==`Name`].Value|[0]}'
# List subnets in a VPC
aws ec2 describe-subnets \
--filters "Name=vpc-id,Values=vpc-0abc123" \
--query 'Subnets[*].{ID:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Public:MapPublicIpOnLaunch}'
# Show route tables for a VPC
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-0abc123" \
--query 'RouteTables[*].{ID:RouteTableId,Routes:Routes}'
# ─── CONNECTIVITY TESTING ────────────────────────────────────────────────────
# Test TCP port connectivity
nc -zv 10.0.20.100 5432 # netcat
telnet 10.0.20.100 5432 # telnet
# Test with curl (HTTP)
curl -v --connect-timeout 5 http://10.0.10.50:8080/health
# Trace network path
traceroute -T -p 443 google.com # TCP traceroute (bypasses ICMP blocks)
mtr --report 8.8.8.8 # continuous traceroute
# ─── SECURITY GROUP MANAGEMENT ───────────────────────────────────────────────
# List all SGs with their rules
aws ec2 describe-security-groups \
--query 'SecurityGroups[*].{ID:GroupId,Name:GroupName,Inbound:IpPermissions}'
# Find SGs with 0.0.0.0/0 inbound
aws ec2 describe-security-groups \
--query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].{ID:GroupId,Name:GroupName}'
# ─── NAT GATEWAY ─────────────────────────────────────────────────────────────
# Check NAT Gateway status and IP
aws ec2 describe-nat-gateways \
--query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIP:NatGatewayAddresses[0].PublicIp,Subnet:SubnetId}'
# ─── VPC ENDPOINTS ───────────────────────────────────────────────────────────
# List all VPC endpoints
aws ec2 describe-vpc-endpoints \
--query 'VpcEndpoints[*].{ID:VpcEndpointId,Service:ServiceName,Type:VpcEndpointType,State:State}'
# ─── FLOW LOGS ───────────────────────────────────────────────────────────────
# Check flow log status
aws ec2 describe-flow-logs \
--query 'FlowLogs[*].{ID:FlowLogId,Resource:ResourceId,Status:FlowLogStatus,Destination:LogDestination}'
# ─── REACHABILITY ANALYZER ───────────────────────────────────────────────────
# Automated connectivity troubleshooting
aws ec2 create-network-insights-path \
--source i-source \
--destination i-destination \
--protocol TCP \
--destination-port 443
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-0abc123
🔧 Terraform VPC Module Best Practices
# Complete production VPC module usage
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "prod-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
public_subnets = ["10.0.0.0/24", "10.0.1.0/24", "10.0.2.0/24"]
database_subnets = ["10.0.20.0/24", "10.0.21.0/24", "10.0.22.0/24"]
# HA NAT Gateway — one per AZ
enable_nat_gateway = true
single_nat_gateway = false # ❌ don't use single NAT in prod
one_nat_gateway_per_az = true # ✅ HA: one per AZ
# DNS
enable_dns_hostnames = true
enable_dns_support = true
# VPN
enable_vpn_gateway = true
# Tags for EKS auto-discovery
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
}
# Flow Logs
enable_flow_log = true
create_flow_log_cloudwatch_log_group = true
create_flow_log_cloudwatch_iam_role = true
flow_log_max_aggregation_interval = 60
tags = {
Environment = "production"
ManagedBy = "terraform"
Owner = "platform-team"
}
}
Add More Questions to This Guide
Know questions that should be here? Share them and help the community!
Open Google Form