02-Kubernetes in the Real World - How Companies Actually Use It
DockerContainersDevOps Beginner 10 min read

02-Kubernetes in the Real World - How Companies Actually Use It

Understand what Docker is, why it exists, and how it solves the 'it works on my machine' problem with containerization.

🌍 Who Uses Kubernetes?

Kubernetes is not a startup toy β€” it powers some of the most critical infrastructure on the planet.

CompanyScaleUse Case
GoogleBillions of containers/weekSearch, Gmail, YouTube
NetflixMillions of streams simultaneouslyGlobal streaming platform
Airbnb1000s of microservicesBooking & marketplace
Spotify300+ microservicesMusic streaming
UberMillions of rides/dayRide-hailing, delivery
Pinterest250M+ monthly usersImage discovery
The New York TimesBreaking news spikesDigital publishing
ZalandoPeak fashion sale trafficEuropean ecommerce
AdidasWorld Cup / Black FridayGlobal retail
Goldman SachsRegulated financial workloadsFinance & trading

According to the CNCF Annual Survey 2023, over 96% of organizations are either using or evaluating Kubernetes in production.


🏒 Real Company Case Studies


πŸ”΅ Google β€” The Birthplace

Context: Google built Kubernetes based on lessons from their internal system Borg, which ran billions of containers per week across global data centers.

What they use it for:

  • Running virtually every Google product β€” Search, Gmail, Maps, YouTube, Google Cloud
  • Managing millions of jobs across thousands of nodes
  • Scheduling batch workloads alongside long-running services on the same cluster

Real problems solved:

  • Resource efficiency: Borg/Kubernetes-style bin packing reduced idle server capacity significantly across their data centers
  • Reliability: Workloads automatically reschedule when hardware fails β€” critical at Google’s scale where hardware failure is a daily occurrence, not an exception
  • Developer velocity: Thousands of engineers can deploy independently without stepping on each other

Key insight:

“At Google scale, a 1% improvement in resource utilization saves millions of dollars. Kubernetes bin packing makes that possible.”


πŸ”΄ Netflix β€” Chaos at Scale

Context: Netflix serves 230+ million subscribers across 190 countries. Their architecture is one of the most studied in the industry β€” hundreds of microservices, globally distributed, handling massive traffic spikes (Friday night movie releases, hit show drops).

What they use it for:

  • Running their entire streaming backend on AWS EKS
  • Executing thousands of batch jobs (video encoding, recommendation model training)
  • Running Chaos Engineering experiments (their famous Chaos Monkey β€” deliberately killing pods to test resilience)

Real problems solved:

  • Resilience: When a pod dies (or is intentionally killed), Kubernetes restarts it automatically β€” their fault tolerance is now a feature, not a risk
  • Multi-region deployments: Kubernetes clusters run in multiple AWS regions β€” if us-east-1 has issues, traffic shifts to eu-west-1
  • Canary deployments: New algorithm versions roll out to 1% of users, metrics are checked, then gradually expanded to 100%

Scale numbers:

  • Thousands of microservices
  • Millions of container instances per day
  • Peak: ~15% of all internet traffic in North America

Key tooling: Netflix built Spinnaker (open-sourced) on top of Kubernetes for multi-cloud continuous delivery.


🟠 Airbnb β€” Migrating a Monolith

Context: Airbnb started as a classic Rails monolith. By 2018, it had grown into an unmaintainable beast that took 12+ minutes to boot. They embarked on one of the most documented monolith-to-microservices migrations in the industry.

What they use it for:

  • Hosting 1000+ microservices on Kubernetes (up from 1 monolith)
  • Running their data infrastructure (Spark jobs, Airflow pipelines) on Kubernetes
  • Developer tooling and internal platforms

Real problems solved:

  • Monolith decomposition: Each extracted service runs as an independent Kubernetes deployment β€” teams own their own services without interfering with others
  • Deployment independence: The 12-minute monolith boot became sub-second per-service deployments
  • Scaling individual bottlenecks: Search could scale independently from booking, which could scale independently from payments β€” instead of scaling the entire monolith

Airbnb’s internal platform β€” “Kubernetes-native developer experience”:

They built an internal developer platform called Ottr on top of Kubernetes so that engineers don’t need to write YAML β€” they fill in a simple form and Kubernetes resources are generated automatically.

Key lesson:

“Microservices without Kubernetes is chaos. Kubernetes gave us the foundation to actually make the migration work.”


🟒 Spotify β€” Developer Productivity at Scale

Context: Spotify runs 300+ microservices with 2000+ engineers. Their core challenge isn’t scale β€” it’s developer productivity. How do 2000 engineers deploy independently without chaos?

What they use it for:

  • Running all backend services on Google GKE
  • Their internal developer platform Backstage (open-sourced, now a CNCF project) sits on top of Kubernetes
  • ML model training and serving for music recommendations

Real problems solved:

  • Service ownership at scale: Every microservice in Kubernetes has a clear owner registered in Backstage β€” no orphaned services
  • Standardized deployments: All 300+ services deploy via the same pipeline β€” engineers don’t reinvent deployment logic per team
  • Onboarding: New engineers can deploy to production on day one because the platform abstracts Kubernetes complexity

The Backstage effect:

Backstage is now used by thousands of companies worldwide (Expedia, American Airlines, Netflix) β€” all built on the idea of a Kubernetes-native internal developer portal.

Scale numbers:

  • 2000+ engineers
  • 300+ microservices
  • Millions of daily active users globally

🟑 Uber β€” Multi-Tenant Microservices

Context: Uber operates in 70+ countries, handling millions of rides and deliveries daily. Their architecture is one of the most complex in the world β€” real-time location tracking, surge pricing, matching algorithms, payments, all running simultaneously.

What they use it for:

  • Running 4000+ microservices across multiple Kubernetes clusters
  • Multi-tenant clusters where dozens of teams share infrastructure
  • Real-time data processing (matching riders to drivers in milliseconds)

Real problems solved:

  • Multi-tenancy: Uber runs massive shared clusters with Namespace-based isolation and resource quotas per team β€” preventing any single team from starving others of resources
  • Geographic distribution: Kubernetes clusters run in every major region to minimize latency for drivers and riders
  • Stateful workloads: Running databases (Cassandra, MySQL) on Kubernetes using StatefulSets for persistence

Uber’s unique challenge β€” the “noisy neighbor” problem:

When one team’s service misbehaves and consumes all CPU on a shared node, other teams suffer. Uber solved this with aggressive LimitRange and ResourceQuota enforcement at the Namespace level.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-payments-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "50"
    requests.memory: 100Gi
    pods: "200"

🟣 Pinterest β€” Cost Savings at Scale

Context: Pinterest serves 250 million+ monthly active users with a complex image-heavy architecture requiring significant compute for image processing, ML recommendations, and search indexing.

What they use it for:

  • All production backend workloads on Kubernetes
  • Large-scale batch processing (image indexing, ML training)
  • Autoscaling during viral content events (a popular pin can cause sudden massive traffic)

Real problems solved:

  • Cost reduction: After migrating to Kubernetes, Pinterest reported ~$1.8 million in annual savings from improved resource utilization
  • Batch + online workloads on same cluster: ML training jobs run alongside live serving, with Kubernetes scheduling them to use spare capacity
  • Viral traffic spikes: A single viral image can cause a 10x traffic spike in seconds β€” HPA handles this automatically

Key technique β€” Cluster Autoscaler + Spot Instances:

Pinterest uses AWS Spot Instances (up to 90% cheaper than on-demand) for batch workloads. Kubernetes automatically provisions spot nodes when needed and reschedules pods when spot instances are reclaimed.

Savings breakdown:

  • Infrastructure cost reduced significantly through bin packing
  • Engineering time saved β€” fewer manual operations
  • Spot instance usage for batch jobs

πŸ“° The New York Times β€” Media at Scale

Context: The NYT serves millions of readers with wildly unpredictable traffic β€” a major breaking news event can cause a 10–50x traffic spike in minutes. Their old infrastructure couldn’t handle this elastically.

What they use it for:

  • Serving nytimes.com and all digital properties
  • Content delivery, article rendering, personalization
  • Running Google GKE in production

Real problems solved:

  • Breaking news spikes: When a major event breaks, HPA spins up pods in under a minute to handle the surge β€” previously they’d scramble manually or the site would slow down
  • CI/CD pipeline: Kubernetes enables dozens of deployments per day across their engineering teams
  • Cost efficiency: Auto-scaling down during low-traffic hours (overnight) saves significant cloud spend

Real quote from NYT Engineering:

“We went from manual scaling that lagged behind traffic spikes by 30+ minutes to Kubernetes autoscaling that responds in under 60 seconds.”

Traffic pattern visualization:

Traffic
  β”‚
  β”‚          β–ˆβ–ˆβ–ˆβ–ˆ  Breaking News Spike
  β”‚         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  β”‚        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  β”‚   β–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β–ˆβ–ˆβ–ˆβ–ˆ
  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  β†’ Kubernetes scales instantly
  └─────────────────────────── Time

πŸ”΅ Zalando β€” European Ecommerce

Context: Zalando is Europe’s largest online fashion platform, operating across 25 countries. They process millions of orders and have extreme traffic peaks during sales events.

What they use it for:

  • Running 200+ teams on shared Kubernetes clusters
  • Their internal platform Kubernetes Operator framework β€” they contributed heavily to the Kubernetes ecosystem
  • Peak sale traffic handling (Black Friday, Cyber Monday)

Real problems solved:

  • Multi-team platform: 200 engineering teams self-service deploy to Kubernetes without involving platform team β€” massive scalability of the platform itself
  • Kubernetes Operators: Zalando built and open-sourced several Kubernetes Operators (Postgres Operator, Skipper ingress) used by the broader community
  • Compliance: Running workloads in EU regions with strict GDPR data residency using Kubernetes node affinity rules

Open-source contributions from Zalando:

  • Postgres Operator β€” manages PostgreSQL clusters on Kubernetes
  • Skipper β€” HTTP router and reverse proxy as Kubernetes Ingress
  • Nakadi β€” event streaming platform built on Kubernetes

πŸ‘Ÿ Adidas β€” Black Friday Survival

Context: Adidas runs global ecommerce with massive traffic spikes during product drops (limited edition sneakers sell out in seconds) and Black Friday. Before Kubernetes, their infrastructure couldn’t handle the load.

What they use it for:

  • Global ecommerce platform on AWS EKS
  • Product launch pages that need to scale from zero to millions in seconds
  • CI/CD across dozens of teams

Real problems solved:

The Yeezy Drop Problem: When Kanye West’s Yeezy sneakers dropped, adidas.com needed to handle millions of concurrent users in seconds. Their old infrastructure would crash. With Kubernetes:

  1. Pre-scale clusters before the drop (known traffic event)
  2. HPA handles the overflow automatically
  3. Circuit breakers (via Istio service mesh) prevent cascade failures
  4. After the drop, scale back down β€” paying only for what was used

Results:

  • Deployments went from 4–6 weeks to minutes
  • From 10 deployments/year to thousands/year
  • Infrastructure costs reduced while handling more traffic

Key quote from Adidas engineering:

“We went from infrastructure being a bottleneck for business to infrastructure being completely invisible to the business.”


🏦 Goldman Sachs β€” Finance Meets Cloud Native

Context: Goldman Sachs runs some of the most latency-sensitive, compliance-heavy workloads in the world β€” trading systems, risk calculations, client data. Adopting Kubernetes in finance is harder because of regulatory requirements.

What they use it for:

  • Running internal developer platforms on Kubernetes
  • Marquee (their external developer platform) built on Kubernetes
  • Risk calculation batch jobs (running millions of Monte Carlo simulations)

Real problems solved:

  • Compliance & auditability: Kubernetes RBAC + audit logging provides the access control and audit trails regulators require
  • Batch compute: Running massive risk calculation jobs that spin up thousands of pods, crunch numbers, and terminate β€” paying only for the compute time used
  • Developer platform standardization: Giving quants and engineers the same self-service deployment experience with guardrails

Finance-specific Kubernetes patterns:

  • PodSecurityPolicies β€” enforce security standards (no root containers, read-only filesystems)
  • Network Policies β€” strict inter-service communication rules
  • OPA Gatekeeper β€” policy-as-code to enforce compliance rules across all deployments
  • Dedicated nodes β€” sensitive trading workloads run on dedicated node pools, isolated from general workloads


🏭 Industry-Wise Adoption

IndustryPrimary Use CasesNotable Users
Streaming / MediaContent delivery, video encoding, recommendation MLNetflix, Spotify, NYT, BBC
E-Commerce / RetailTraffic spike handling, order processing, inventoryAdidas, Zalando, Shopify, Target
FinanceRisk calculation, trading platforms, complianceGoldman Sachs, Capital One, Fidelity
Ride-sharing / LogisticsReal-time matching, tracking, dispatchUber, Lyft, DoorDash
Travel / HospitalityBooking, pricing, availabilityAirbnb, Booking.com, Expedia
HealthcarePatient data processing, imaging AI, complianceMany with strict HIPAA configurations
TelecommunicationsNetwork function virtualization (NFV), 5GVerizon, AT&T, Deutsche Telekom
GamingGame server orchestration, matchmakingActivision, EA, Riot Games

Built with ❀️ β€” contributions and corrections welcome.