Home / System Design / What-If Scenarios

What-If Scenarios

What breaks, why it cascades, and how to design against it. Every scenario includes a peak traffic variant — failures at 10× normal load are fundamentally different from failures at idle.


Scenario 1: Redis Cluster Fully Crashes

When the cache layer disappears, all traffic falls through to the database simultaneously — triggering a cascade that can take down the entire system.

System: App Server → Redis Cache → Database

What breaks

Every request is now a cache miss. All traffic that was being absorbed by Redis hits the DB directly.

Traffic comparison:

Normal:      1,000 req/sec → Redis (90% hit) → 100 req/sec to DB
After crash: 1,000 req/sec → DB directly    → DB overwhelmed

Cascade

  1. DB receives 10× normal read traffic instantly
  2. Queries slow down, DB connection pool fills up
  3. App server threads all blocked waiting on DB connections
  4. App servers timeout or crash — load balancer health checks fail
  5. LB removes app servers from rotation — remaining servers absorb more load — they fail too
  6. User sees a 500 Internal Server Error or 503 Service Unavailable after 30 seconds of waiting

At peak (10× traffic)

The cascade takes seconds, not minutes. DB goes from 1,000 to 100,000 req/sec instantly. Almost certainly fatal without layered protection.

Prevention

No single component saves you. Three layers working together:

Failure flow (Redis down):

Request → [L1 in-process cache] → [Redis] → [DB with circuit breaker]
                ↓ (Redis down)
         L1 hit?  → serve stale (80% of traffic, Pareto rule)
         L1 miss? → circuit breaker decides:
                      CLOSED → go to DB (controlled rate via connection pool)
                      OPEN   → return degraded response immediately

Layer 1 — L1 in-process cache (always-on, not just a fallback):

L1 lives inside the app server's own memory (Caffeine in Java, functools.lru_cache in Python). It runs always — not only when Redis is down. It saves a network hop for the hottest keys before requests even reach Redis.

The Pareto rule applies: ~1% of keys get ~80% of traffic. L1 only needs to hold ~10K items to cover the overwhelming majority of requests. When Redis dies, L1 keeps serving those hot keys with no DB pressure at all.

Layer 2 — Circuit breaker on DB (for L1 misses):

  • Circuit CLOSED: DB is handling load. L1 misses go to DB normally (still protected by connection pool limit).
  • Circuit OPEN: DB is overwhelmed. Skip the DB call entirely. Return fallback in < 1ms. DB gets breathing room to recover.
  • Circuit HALF-OPEN: After 30s, probe with one request. Success → close. Failure → stay open.

What the user sees:

State User experience
L1 hit Normal response, maybe seconds stale. Nobody notices.
L1 miss, circuit closed, DB OK Normal response, slightly slower (~10ms DB read)
L1 miss, circuit open Fast 503 in < 1ms, OR degraded response (empty feed, "try again")
No protection at all 500 after 30 second timeout. DB never recovers.

Layer 3 — Graceful degradation (for circuit-open responses):

  • Feed: return empty or show a "content temporarily unavailable" message
  • User profile: return last known data from L1 even if stale
  • Search: return empty results with a retry suggestion
  • Payments/writes: queue the request, return 202 Accepted, process when recovered

Circuit breaker: application code vs infrastructure:

  • Service mesh (Istio + Envoy): Sidecar proxy intercepts all traffic, circuit breaks at network layer. No app code changes. Coarser — can't return business-level fallback values.
  • RDS Proxy (AWS): Sits in front of DB, pools connections, queues during failover. Prevents connection exhaustion specifically.

Best practice: service mesh for the network layer + application-level for fallback logic.

Redis failure prevention:

  • Redis Cluster with replicas: Automatically promotes a replica when a node dies. ~1–10 second gap, not total loss.
  • TTL jitter: TTL = base + random(0, 30s). Prevents all keys expiring simultaneously → thundering herd.

Recovery

  1. Redis restarts — cache is cold (empty). Opening full traffic immediately = thundering herd repeat.
  2. Cache warming first: Pre-populate Redis with the hottest keys (top 1% that L1 was covering). Run a warming script or replay recent read logs.
  3. Gradually re-enable traffic (10% → 25% → 100%) while watching DB load.
  4. L1 stays warm throughout — users on L1 hits never noticed the outage.

Interview takeaway

  • L1 in-process cache (Caffeine) should be always-on, not a Redis fallback — it absorbs the Pareto hot keys before any network hop
  • Circuit breaker on the DB prevents connection exhaustion; fast fallback beats a 30-second timeout that kills the whole pool
  • Cache warming is mandatory after Redis restart — cold traffic spike = thundering herd repeat
  • Redis Cluster + TTL jitter are the prevention controls; L1 + circuit breaker are the resilience controls

Scenario 2: Database Primary Goes Down

All writes fail for a 30–60 second window while the replica is promoted — how you handle that window determines whether users see errors or barely notice.

System: App Server → DB Primary (+ standby replica)

What breaks

  • All writes fail immediately
  • Reads from primary fail
  • If replica is async: data written in the last few seconds may be lost (RPO = seconds)
  • Replica promotion takes 30–60 seconds (RDS Multi-AZ) or longer if manual

Cascade

  1. Primary dies — write requests start failing immediately
  2. Replica exists but hasn't been promoted yet; DNS still points at the dead primary
  3. App retries against dead host — threads pile up, connection pool exhausts
  4. App servers cascade-crash if retry logic is aggressive
  5. After 30–60 seconds, replica is promoted and DNS updated
  6. Traffic resumes — but any writes during the window are lost or need replay

At peak (10× traffic)

Write failures spike. Aggressive retries against the dead host pile up threads faster — app servers cascade-crash before failover completes, compounding the outage.

Prevention

The failover window (30–60 seconds) is the core problem — the replica is healthy but promotion takes time. Pick the right option:

Option 1 — Database proxy (simplest, no app changes): A database proxy sits between app and DB. When the primary dies, it queues incoming write requests for a few seconds then reconnects to the promoted replica. App code changes nothing.

Flow (with DB proxy):

App → Proxy → [primary dies] → Proxy queues → new primary up → flush queue

Pick the proxy for your database: PgBouncer or Pgpool-II for PostgreSQL, ProxySQL or MySQL Router for MySQL, RDS Proxy for any DB on AWS.

Option 2 — Retry with exponential backoff + idempotency key: Write fails → app catches the error → waits 1s, retries → waits 2s, retries → failover completes in ~30–60s → retry succeeds. Idempotency key on the request ensures no double-writes if a write actually landed before the crash.

Option 3 — Return 503 + Retry-After header: Accept that writes are unavailable for 30–60 seconds. Return a proper HTTP 503 with Retry-After: 60. Correct choice when brief write unavailability is acceptable.

Option 4 — Kafka write buffer (only if Kafka is already in your stack): App servers publish writes to Kafka during the failover window. Once the new primary is up, consumers replay the buffered writes. Don't introduce Kafka just for this — Options 1–3 are simpler.

Data loss prevention:

  • Synchronous standby: Primary waits for the standby to confirm before acking the write. RPO = 0. Costs +5–20ms write latency. Worth it for financial systems, not for social apps.
  • Async standby (default): Primary acks immediately, replication happens in background. RPO = seconds. Faster writes but small data loss window on crash.
  • Read from replica during downtime: Route all reads to the replica immediately on primary failure. Replica stays alive and serving reads throughout.

Automatic failover setup:

  • AWS RDS Multi-AZ: Automatic promotion, ~60 second RTO. DNS updated automatically — if using RDS Proxy or PgBouncer, app reconnects transparently.
  • Patroni (self-managed Postgres): Open-source HA manager using etcd/Consul/ZooKeeper for leader election. Auto-promotes replica.

Recovery

  1. Replica promoted to new primary automatically.
  2. Database proxy (PgBouncer, ProxySQL, RDS Proxy) reconnects to new primary — no app restart needed.
  3. Old primary comes back online as a new replica and catches up via WAL replay.
  4. Monitor replication lag on the new replica — it starts at zero and needs time to sync.

Interview takeaway

  • The 30–60 second promotion window is the design problem — not the failover itself; solve it with a proxy or retry logic
  • DB proxy (PgBouncer/RDS Proxy) queues writes transparently during the window — app code sees no connection errors
  • Synchronous standby = RPO 0 at +20ms write latency; async standby = faster writes but seconds of potential data loss — match to business requirement
  • Read traffic should be routed to replica immediately on primary failure — replica is healthy throughout the window

Scenario 3: Kafka Consumer Lag Grows Uncontrollably

Consumers fall behind producers and the backlog grows — left unchecked, messages are deleted by retention policy and data is permanently lost.

System: Producer → Kafka → Consumers → DB

What breaks

Consumers can't keep up with producer throughput. Messages pile up in Kafka. Lag grows from seconds → minutes → hours.

  • Real-time features become stale (notifications delayed, feeds outdated)
  • If lag hits retention limit, old messages are deleted — permanent data loss
  • Downstream systems waiting on events appear frozen

Cascade

  1. Produce rate exceeds consumer throughput — offset gap widens
  2. Real-time features (notifications, feeds) become increasingly stale
  3. Lag approaches retention boundary — delete-before-consume becomes imminent
  4. Messages deleted by retention — permanent data loss, no recovery possible

At peak (10× traffic)

A traffic spike (flash sale, live event) can push produce rate 10× above consumer capacity in minutes. Lag goes from 0 to hours quickly. At this point data loss is a matter of when, not if.

Prevention

  • Monitor consumer lag: Alert when lag exceeds N minutes. Don't wait for it to be hours.
  • Scale consumers horizontally: Add more consumer instances — max parallelism is capped at partition count.
  • Pre-scale partitions: Partition count sets the ceiling for consumer parallelism. Set partitions higher than you need today. You can't easily add partitions later.
  • Separate fast and slow consumers: Put slow consumers (ML processing, analytics) on a separate consumer group from fast consumers (real-time notifications). Slow consumers lagging doesn't affect fast ones.
  • Retention buffer: Set Kafka retention long enough to survive your worst-case lag scenario. If lag could hit 6 hours, retain messages for at least 24 hours.

Recovery

  1. Add more consumer instances (up to partition count).
  2. If consumers are slow due to a bad query, fix the query first — more consumers won't help.
  3. Temporarily throttle producers if lag is critical.
  4. Monitor lag going down — recovery can take hours if the backlog is large.

Interview takeaway

  • Partition count is the hard ceiling on consumer parallelism — set it higher than you need today; you can't easily add partitions later
  • Separate slow consumers (ML, analytics) from fast consumers (notifications) into different consumer groups — one slow group doesn't drag down real-time
  • Alert on lag in minutes, not hours — lag compounds; waiting for it to be obvious means data is already at risk of deletion
  • Retention buffer must exceed your worst-case lag scenario with headroom — otherwise recovery becomes impossible

Scenario 4: Sudden 10× Traffic Spike

Black Friday, a viral moment, or a live event ends — each layer fails in a predictable order unless you've designed for headroom.

System: CDN → LB → App Servers → Redis → DB

What breaks

  1. App servers CPU pegs at 100% first — can't process requests fast enough
  2. Request queue on LB grows — users see slow responses
  3. Redis connections saturate — Redis has a connection limit (~10K default)
  4. DB connection pool exhausts — new DB connections rejected
  5. App servers OOM-killed, health checks fail, LB removes them — makes things worse

Cascade

  1. Traffic spike arrives — CDN absorbs static content, dynamic requests hit app servers
  2. App servers saturate — CPU at 100%, threads queued
  3. Auto-scaling kicks in — but takes 2–5 minutes; spike hits before new instances are healthy
  4. Redis connection limit hit — new requests can't open connections
  5. DB connection pool exhausted — queries rejected, app threads hang
  6. App servers OOM-killed — LB removes them — remaining servers get more load — they fail too

At peak (10× traffic)

Layer Survives 10× spike? Why
CDN Yes Infinitely scalable edge
Load Balancer Yes Managed, auto-scales
App Servers Only with auto-scaling + warm capacity Takes 2–5 min to scale
Redis Yes (if connections not exhausted) In-memory, very fast
DB (with connection pooler) Probably PgBouncer absorbs connection storm
DB (without connection pooler) No Connection exhaustion

Prevention

  • CDN for everything possible: Static content, API responses with cache headers. A CDN cache hit never hits your servers. Best protection against traffic spikes.
  • Auto-scaling groups: App servers scale out automatically when CPU > 60%. Takes 2–5 minutes for new instances to be healthy — not instant; need to absorb the initial spike with existing capacity.
  • Connection pooler for DB: PgBouncer sits between app servers and DB. App servers open thousands of connections to PgBouncer; PgBouncer maintains a small pool (~100) to DB. Prevents DB connection exhaustion.
  • Queue as buffer: For write-heavy paths, put Kafka/SQS in front of processing. App servers just enqueue — processing happens at a controlled rate.
  • Rate limiting at API Gateway: Cap requests per user/IP. Prevents one client from consuming all capacity. Sheds load gracefully before it reaches your servers.
  • Load shed: When the system is above capacity, reject low-priority requests with 503. Better to fail fast for some users than to slow down for all.

Recovery

  1. Auto-scaling adds instances — takes 2–5 minutes to become healthy.
  2. Rate limiting and load shedding reduce pressure on overloaded layers while scaling completes.
  3. CDN cache warms up — repeated requests stop hitting origin.
  4. Monitor DB connection count and Redis connection count returning to normal as load distributes.

Interview takeaway

  • CDN is the most effective spike protection — cache hits never touch your servers
  • Auto-scaling takes 2–5 min; absorb the initial spike with pre-warmed capacity, rate limiting, or load shedding
  • Connection pooler (PgBouncer) is non-negotiable — DB connection exhaustion is the most common cascade trigger at peak
  • Load shedding (fail fast for some) beats uniform degradation for all — preserve core functionality for most users

Scenario 5: Network Partition Between Two Datacenters

The two regions can no longer communicate — CAP theorem forces a choice between staying available with diverging data, or pausing writes to maintain consistency.

System: Active-Active multi-region (Region A + Region B)

What breaks

Regions can't communicate. Both regions are still up internally but can't sync with each other. This is the CAP theorem in practice.

Two choices (you must pick one):

Option A — Prioritize Availability: Both regions keep serving traffic independently. Writes in Region A and Region B diverge. When partition heals, you have conflicting data that must be merged or one side wins.

Option B — Prioritize Consistency: One region becomes read-only (or fully rejects writes) during the partition. No conflicting writes. Data is consistent when partition heals. But one region's users can't write.

Cascade

  1. Network partition occurs — inter-region replication stops
  2. Both regions continue accepting writes independently
  3. Data diverges — the same record may be updated differently in each region
  4. Partition heals — conflicting versions must be resolved; one or both writes may be wrong

At peak (10× traffic)

More writes during the partition window = more conflicts to resolve when the partition heals. High-write-rate systems accumulate conflicts faster and face a more expensive reconciliation.

Prevention

  • Redundant network paths between regions (multiple ISP connections)
  • Health checks with short timeout to detect partition fast
  • Design for partition-tolerance at the data level (vector clocks, conflict-free data types)

Recovery

  • Last-write-wins (LWW): Use timestamp to pick winner. Simple. Risk: clock skew causes the wrong write to win.
  • Application-level merge: App understands the data and knows how to merge (e.g. a shopping cart — union of items from both sides).
  • User-visible conflict: Show the user both versions and let them pick (Google Docs history, Dropbox conflicted copy).

Real examples:

  • DynamoDB: eventual consistency by default. Partition heals, last write wins.
  • Zookeeper: stops accepting writes during partition to maintain consistency.

Interview takeaway

  • CAP theorem forces a binary choice at partition time: availability (diverging writes) or consistency (one region goes read-only)
  • Match conflict resolution to business semantics — LWW works for user preferences, but not for inventory or payments where both writes may be valid
  • Short DNS TTL (60s) is a prerequisite for fast region failover — long TTL makes DNS-based failover meaningless
  • Redundant inter-region network paths reduce partition probability; vector clocks or CRDTs reduce the damage when it happens

Scenario 6: Slow DB Query Cascades Across the System

One unindexed query holds a DB connection for seconds — at scale, the connection pool exhausts in under a second and takes the whole app down with it.

System: App Server → DB

What breaks

One slow query (missing index, N+1, full table scan) holds a DB connection for seconds instead of milliseconds.

Cascade

  1. Slow query holds 1 connection for 10 seconds
  2. At 100 QPS, 10 seconds = 1,000 queries all waiting
  3. DB connection pool (size 100) exhausts in < 1 second
  4. New requests can't get a DB connection → throw errors
  5. App server threads all blocked waiting for connections → OOM or timeout cascade
  6. Load balancer health checks fail → removes app server → remaining servers get more load → they fail too

At peak (10× traffic)

What takes 30 seconds at normal load takes 3 seconds at 10× load. Connection pool exhaustion is near-instant.

Prevention

  • Query timeout: Set a max execution time on every query (e.g. 2 seconds). Slow queries are killed, connection freed. Better to fail one request than deadlock everything.
  • Connection pool limit: PgBouncer caps connections to DB. App servers queue at PgBouncer rather than overwhelming DB.
  • Read replica for heavy queries: Route analytics/reporting queries to a replica so they don't compete with transactional queries on primary.
  • Query monitoring: Slow query log + alerting. Know before users tell you.
  • Index reviews: EXPLAIN ANALYZE before deploying any new query to production.

Recovery

  1. Identify the slow query (pg_stat_activity, slow query log).
  2. Kill it: SELECT pg_cancel_backend(pid) or pg_terminate_backend(pid).
  3. Once killed, connection pool drains, backlog clears, system recovers within seconds.
  4. Add the missing index. Deploy without downtime using CREATE INDEX CONCURRENTLY.

Interview takeaway

  • One slow query at 100 QPS can exhaust a 100-connection pool in under a second — the blast radius is system-wide, not query-scoped
  • Query timeout is the most important single control — it kills the query, frees the connection, and stops propagation
  • CREATE INDEX CONCURRENTLY adds indexes without locking the table — safe for production deployments
  • EXPLAIN ANALYZE every new query before production, not just in dev — slow queries are invisible until they aren't

Scenario 7: App Server Memory Exhaustion (OOM)

The OS kills the process when memory runs out — all in-flight requests are dropped, and the server may crash-loop if the root cause isn't fixed first.

System: LB → App Servers

What breaks

App server process consumes all available RAM. OS OOM killer terminates the process. All in-flight requests on that server are dropped. Load balancer health check detects the crash and routes away — but the instance may keep restarting and crashing immediately if the root cause isn't fixed.

Common causes at peak traffic:

  • Large response objects held in memory for too many concurrent requests
  • Memory leak (objects not garbage collected under load)
  • In-process cache growing without bound

Cascade

  1. Memory usage climbs under load — garbage collector can't keep up
  2. OS OOM killer terminates the process — all in-flight requests dropped
  3. Instance restarts — if root cause persists, it crashes again immediately (crash-loop)
  4. LB health check detects crash, routes traffic to remaining servers
  5. Remaining servers absorb extra load — their memory usage climbs — they crash too

At peak (10× traffic)

A slow memory leak that takes hours at idle hits the ceiling in minutes under load. Crash-looping spreads to all instances faster than auto-scaling can add new ones.

Prevention

  • Set memory limits on containers/processes. OOM kill is contained, doesn't affect other processes on the same host.
  • Cap in-process cache size. Bounded caches (LRU with max size) prevent unbounded growth.
  • Horizontal scaling before hitting limits. Scale out at 70% memory utilization, not 100%.
  • Circuit breaker for large payloads. Reject requests that would require allocating more than N MB to process.

Recovery

  1. LB removes OOM-killed instance from rotation automatically.
  2. Instance restarts (auto-restart policy in ECS/K8s).
  3. Root cause: find the memory hog (heap dump, profiler) before it crashes again.

Interview takeaway

  • OOM kill is a clean failure — LB detects it and routes away; the danger is crash-looping if the root cause isn't addressed before restart
  • Container memory limits (K8s/ECS) isolate the blast radius to one container — without limits, the OOM killer may target other processes on the same host
  • Scale-out trigger at 70% memory utilization, not 100% — OOM kills happen suddenly at the ceiling, not gradually
  • Bounded in-process caches (LRU with a size cap) are the most common fix — unbounded caches are the most common cause

Scenario 8: CDN Goes Down

The CDN absorbs 90–95% of traffic for video platforms — when it fails, all of that traffic hits your origin simultaneously and the origin simply cannot cope.

Most relevant for: Video streaming (Netflix, YouTube), any app serving static assets globally.

System: Users → CDN → Origin Servers

What breaks

CDN absorbs 90–95% of traffic for video platforms. When it fails, all of that traffic hits your origin servers simultaneously.

Traffic comparison:

Normal:   1M req/sec → CDN (95% hit) → 50K req/sec to origin
CDN down: 1M req/sec → origin directly → origin designed for 50K, now gets 1M

Cascade

  1. CDN goes down — all traffic bypasses cache and hits origin
  2. Origin servers overwhelmed instantly — CPU pegs at 100%
  3. Video chunks can't be served fast enough — users see buffering
  4. Origin DB and storage layer get hammered — slow reads, timeouts
  5. Users see 503s or video players show error screens

At peak (10× traffic)

A CDN outage during a major live event (World Cup, product launch) is catastrophic. The origin simply cannot absorb the full load. Even a partial CDN degradation during peak is effectively fatal without multi-CDN.

Prevention

  • Multi-CDN: Run two CDN providers simultaneously (Cloudflare + CloudFront). If one fails, DNS fails over to the other in seconds. Netflix does this.
  • Keep origin auto-scalable: Origin should handle 2–3× normal load even though CDN absorbs most traffic. When CDN degrades partially, origin traffic rises gradually.
  • Browser cache headers: Cache-Control: max-age=3600 means users have a local browser cache. Short CDN outages are invisible to users who already fetched the content.
  • Stale-while-revalidate: Serve stale CDN content while fetching fresh from origin in the background. CDN degraded ≠ CDN dead.

Recovery

  1. Switch DNS to backup CDN provider — TTL must be short (60s) for fast failover.
  2. Scale up origin auto-scaling group to absorb extra traffic during switchover.
  3. CDN cache starts cold on the new provider — expect a cache miss storm for the first few minutes until it warms up.

Interview takeaway

  • Multi-CDN (Cloudflare + CloudFront) is the only real solution — failover in seconds via DNS; single CDN is a SPOF for traffic-heavy systems
  • Short DNS TTL (60s) is a hard prerequisite for fast CDN failover — default TTLs of 3600s make it useless
  • Origin must handle 2–3× its normal load to survive partial CDN degradation — design capacity accordingly
  • Cold cache on new CDN = miss storm for first few minutes; pre-warm critical content or accept degraded performance during switchover

Scenario 9: WebSocket Server Crashes

Unlike stateless servers, a WebSocket server crash disconnects every pinned client simultaneously — the thundering herd reconnect and message loss make this uniquely painful.

Most relevant for: Real-time chat (WhatsApp, Slack), multiplayer gaming, live collaboration.

System: Clients → LB → WebSocket Servers → Redis + DB

What breaks

WebSocket connections are stateful — each client is pinned to a specific server. When that server crashes, all its connected clients disconnect simultaneously.

Failure flow:

WebSocket server (10K clients connected) crashes
→ 10K clients all try to reconnect at the same moment
→ thundering herd on auth service + remaining WebSocket servers
→ in-flight messages at time of crash are lost

This is different from a stateless app server crash. You can't just route to another server — the connection state (who is connected, what channel they're in) is gone.

Cascade

  1. WebSocket server crashes — all pinned connections drop simultaneously
  2. 10K clients attempt reconnect at the same moment — thundering herd on auth service
  3. Auth service saturates — slow token validation delays reconnects
  4. Remaining WebSocket servers absorb reconnecting clients — memory spikes
  5. Messages sent during the crash window are lost (if not persisted)

At peak (10× traffic)

More clients per server means a single crash disconnects more clients simultaneously. The reconnect storm is proportionally larger, and remaining servers are already near capacity.

Prevention

  • Client reconnect with exponential backoff + jitter: On disconnect, client waits random(0, 1s), retries. If that fails, waits random(0, 2s), etc. Jitter spreads the reconnect storm across seconds rather than hitting all at once.
  • Store session state in Redis, not server memory: Connected user list, room memberships, undelivered messages — all in Redis. When a client reconnects to any server, it can restore its session from Redis.
  • Message acknowledgment protocol: Client acks every received message. Server marks unacked messages in Redis. On reconnect, server replays unacked messages to the client. No silent message loss.
  • Persist messages to DB: Every sent message written to DB before delivery attempt. If server crashes mid-delivery, messages survive and can be replayed on reconnect.

Recovery

  1. Load balancer detects crashed server, stops routing new connections to it.
  2. Clients reconnect with backoff — spread over 10–30 seconds due to jitter.
  3. Each client reconnects to any available server, restores session from Redis.
  4. Server queries DB/Redis for undelivered messages, pushes them to reconnected client.

Interview takeaway

  • Session state (connections, room memberships, unacked messages) must live in Redis, not server memory — otherwise a crash loses all state permanently
  • Client exponential backoff + jitter is non-negotiable — without it, 10K simultaneous reconnects are as damaging as the crash
  • Message ack protocol + DB persistence ensures no silent message loss; without it, crashes silently drop in-flight messages
  • Sticky session awareness in the LB is required for WebSocket — plain round-robin routes reconnects to wrong servers without session restore

Scenario 10: Third-Party Payment Processor Goes Down

Your system is healthy but the payment network isn't — how you queue, retry, and reconcile determines whether you lose revenue or just delay it.

Most relevant for: Any system accepting payments (e-commerce, SaaS billing, marketplaces).

System: App Server → Payment Processor (Stripe / Braintree)

What breaks

Your system is healthy. Stripe, Visa network, or your payment gateway is down. Payment requests reach your server fine but fail at the processor call.

Failure flow:

User submits payment → your server OK → call Stripe API → 503 / timeout
→ charge fails → user sees error → you've potentially lost the sale

Cascade

  1. Processor returns 503 / timeout on payment calls
  2. Without a circuit breaker, app threads pile up waiting on dead processor
  3. Thread pool exhausts — payment endpoint becomes unresponsive
  4. Cascades to other endpoints sharing the same app server thread pool

At peak (10× traffic)

Processor outage at exactly the worst moment (Black Friday, flash sale). High volume of failed transactions, frustrated users, lost revenue. Thread exhaustion cascades faster under load.

Prevention

  • Queue payment requests, don't fail immediately: Instead of returning an error to the user, store the payment request in a durable queue (Kafka, SQS) and return "processing". Retry against the processor when it recovers. Works for async flows (subscription billing, B2B invoicing).
  • Circuit breaker on processor calls: Trip open after N failures. Return a friendly "payment processing delayed" message instead of a technical error. Protects your system from piling up threads waiting on a dead processor.
  • Multiple processor support (primary/fallback): Integrate two processors (Stripe + Braintree). If primary fails, route to fallback. Adds integration complexity but eliminates single point of failure.
  • Idempotency keys on every retry: When processor recovers and you replay queued payments, include the original idempotency key so duplicate submissions don't cause double charges.

Recovery

  1. Processor comes back up — circuit breaker half-opens, probes with one request, closes on success.
  2. Replay queued payment requests in order. Monitor for duplicate charges.
  3. Run reconciliation: compare your queued requests against processor's transaction log to confirm which succeeded and which still failed.
  4. Notify affected users of outcome (charged successfully / please retry).

Interview takeaway

  • Durable queue (Kafka/SQS) + idempotency keys = no lost payment attempts and no double charges on replay — these two controls together
  • Circuit breaker prevents thread pile-up against a dead processor — return a user-friendly message, not a thread-exhausting timeout
  • Multiple processor support eliminates the SPOF — worth the integration complexity for any high-revenue flow
  • Reconciliation is mandatory after any processor outage — don't assume queue replay is complete or that the processor's state matches yours

Scenario 11: Hot Spot / Hot Key

One data point — a viral video, celebrity account, or surge zone — receives a wildly disproportionate share of traffic and overwhelms a single node while the rest of the cluster sits idle.

Most relevant for: All four systems — viral video (YouTube), downtown Uber surge, celebrity tweet, flash sale product.

System: App Server → Redis / DB Shards / Kafka Partitions

What breaks

One Redis key, one DB shard, or one Kafka partition receives dramatically more traffic than all others. That single node becomes the bottleneck while the rest of the cluster sits idle.

Traffic comparison:

Normal:    traffic spread across 10 shards evenly → each handles 10%
Hot spot:  one event/celebrity/location → 80% of traffic hits shard 3
           → shard 3 overwhelmed, shard 1,2,4-10 at 2% utilization

Examples:

  • Uber: every driver in downtown Manhattan → all location writes hit the same geo shard
  • YouTube: one video goes viral → millions of requests for the same CDN cache key
  • Redis: celebrity user's follower list → one Redis key read millions of times per second
  • Kafka: all events for one user partition key → one partition gets all the load

Cascade

  1. Hot key/shard receives disproportionate traffic — node CPU or memory saturates
  2. Reads/writes to that node slow down — latency spikes for all requests touching it
  3. Retries pile up against the slow node — worsening the overload
  4. Rest of cluster is healthy — the bottleneck is structural, not capacity-wide

At peak (10× traffic)

A hot spot that causes latency degradation at normal traffic causes full node failure at 10× load. The asymmetry means capacity planning for the cluster is irrelevant — only the hot node matters.

Prevention

For Redis hot keys:

  • Local in-process cache for the hot key: Read from L1 first. Reduces Redis reads from millions to near zero for that key.
  • Key replication with random reads: Store the same value under key:1, key:2, ... key:N. Reads randomly pick one. Spreads load across N Redis nodes.

For DB hot shards:

  • Composite shard key: Instead of sharding by city, shard by city + random_suffix. More even distribution at the cost of fan-out reads.
  • Read replicas for hot data: Route reads for popular content to dedicated replicas.

For Kafka hot partitions:

  • Increase partition count and use a more granular partition key (e.g. city_id + driver_id instead of just city_id).
  • Random salt on partition key for write-heavy paths where ordering within the key isn't required.

Recovery

Hot spots are design issues, not runtime failures — the system degrades slowly rather than crashing suddenly. Mitigation is applied while live: add key replication to Redis, increase partition count in Kafka (requires rebalance), add read replica for hot DB shard.

Interview takeaway

  • Hot spots are design failures, not runtime failures — the system degrades slowly and is hard to detect without per-node monitoring
  • Key replication in Redis (key:1 through key:N with random reads) spreads read load across N nodes — no code change on the write path
  • Composite shard keys (city + suffix) distribute DB writes evenly at the cost of fan-out reads — the right tradeoff for write-heavy paths
  • Partition key granularity determines Kafka hot partition risk — user_id + event_type beats city_id for location events

Scenario 12: Message Ordering Broken

Network retries and different routing paths mean messages can arrive out of order — without server-side sequencing, recipients see incoherent conversations.

Most relevant for: Real-time chat, collaborative editing.

System: Client → WebSocket/API Server → Message Store + Kafka

What breaks

Client sends message A then message B. Due to different network paths or retries, message B arrives at the server first. Recipient sees B before A — conversation is incoherent.

Also happens with retries: message A times out, client resends, both copies arrive → duplicate messages out of order.

Cascade

  1. Client sends A, then B — different paths/retries cause B to arrive first
  2. Server stores B before A — sequence in store is wrong
  3. Recipient's client renders in arrival order — conversation is incoherent
  4. Client retries timed-out message A — duplicate arrives after both have been stored
  5. Recipient sees duplicate B and out-of-order A

At peak (10× traffic)

Higher server load means more retry timeouts and more multi-path routing — out-of-order delivery becomes the norm, not the exception.

Prevention

  • Server-assigned sequence numbers: Server assigns a monotonically increasing sequence number per conversation on receipt. Clients display messages in sequence order, not arrival order. Simple and effective.
  • Client-side ordering buffer: Client holds received messages for ~100ms, sorts by sequence number, then renders. Handles slight reordering without visible flicker.
  • Exactly-once delivery with idempotency key: Client includes a UUID with each message. Server deduplicates on UUID before assigning sequence number. Prevents duplicates from retries.

Recovery

Out-of-order delivery is a design gap, not a runtime failure — there's no recovery from a conversation the recipient has already seen in the wrong order. Prevention is the only fix.

Interview takeaway

  • Server-assigned sequence numbers decouple delivery order from display order — never trust arrival order for sequencing
  • Client-side ordering buffer (~100ms) handles slight reordering without visible flicker — cheap and effective
  • Idempotency key (UUID per message) deduplicates retries before sequence assignment — prevents the same message appearing twice with two different sequence numbers
  • At scale, network retries and multi-path routing make out-of-order delivery guaranteed, not rare — design for it from the start

Scenario 13: Video Encoding Pipeline Backs Up

Upload succeeds but encoding workers fall behind — the video exists in storage but isn't watchable, and users don't know why.

Most relevant for: Video streaming platforms (YouTube, TikTok).

System: Upload Service → Object Storage → Encoding Queue (SQS/Kafka) → Encoding Workers

What breaks

Upload succeeds — the video file is in object storage — but encoding workers can't keep up. Video is uploaded but not yet watchable. Backlog grows.

Causes: Viral upload event, encoding worker crash, expensive source format requiring heavy transcoding.

Cascade

  1. Uploads spike — encoding queue depth grows faster than workers drain it
  2. Videos in "processing" state accumulate — users see no feedback and assume uploads failed
  3. Encoding workers may crash on malformed files — requeue loops block healthy messages
  4. SLA for HD availability stretches from seconds to hours

At peak (10× traffic)

A viral upload event or new feature launch can flood the queue instantly. Encoding is CPU-bound — workers can't process faster under load, only more workers help.

Prevention

  • Encode lowest quality first: Process 360p immediately, then 720p, then 1080p, then 4K. Video is playable within seconds of upload even if HD isn't ready.
  • Scale workers horizontally: Encoding is CPU-bound and embarrassingly parallel — each video encodes independently. Auto-scale encoding workers based on queue depth.
  • Show "processing" state to user: Don't leave users confused. Show a progress indicator. Set expectation that HD will be available shortly.
  • Alert on queue depth: If encoding queue exceeds N hours of backlog, page the on-call team before users complain.

Recovery

Spin up more encoding worker instances. Queue drains as workers catch up. No data loss — source video is safely in object storage.

Interview takeaway

  • Encode lowest quality first (360p → 1080p → 4K) — video is playable within seconds; HD is a subsequent improvement, not a prerequisite
  • Encoding is embarrassingly parallel — auto-scale workers on queue depth, not CPU; each video is independent
  • Source video durably stored in object storage means no data loss — a backed-up queue is a latency problem, not a durability problem
  • Show "processing" state proactively — silent failures ("my upload disappeared") generate more support load than an honest progress indicator

Scenario 14: CDC Connector Goes Down

The application keeps running normally, but the event stream silently stops — audit records go missing and the damage isn't discovered until reconciliation runs hours later.

Most relevant for: Payment systems, any system using CDC for audit trails or event-driven architecture.

System: App → DB → CDC Connector (Debezium) → Kafka → Downstream Consumers

What breaks

The application DB continues accepting writes normally. But the CDC connector — which reads the WAL/binlog and publishes to Kafka — stops. Events stop flowing downstream. DB and event stream silently diverge.

Failure flow:

App → DB (writes continue, no errors)
CDC connector ✗ stops
→ Kafka receives no new events
→ Audit service: missing records
→ Reconciliation service: blind to new transactions
→ Webhook service: merchants stop receiving notifications
→ Analytics: data goes stale

No errors surface to merchants or the application. The system appears healthy. The damage is invisible until someone queries the audit log or reconciliation runs.

Cascade

  1. CDC connector stops — WAL/binlog is still being written, but no events published
  2. Kafka topics see no new messages — downstream consumers appear idle
  3. Audit trail gaps grow silently — each new DB write is an unlogged event
  4. Reconciliation runs hours later — discovers a gap spanning thousands of transactions
  5. Merchant webhooks were never fired — merchants' systems are out of sync

Why this is dangerous: Unlike most failures, CDC failure is silent and delayed. The damage accumulates in the background and is discovered later — sometimes hours or days later during compliance checks.

At peak (10× traffic)

If CDC stops during a Black Friday spike, thousands of transactions go unlogged in the audit trail. Reconciliation runs the next morning and finds a gap of hours. The replay volume itself may overwhelm downstream consumers.

Prevention

  • Redundant CDC instances: Run multiple independent CDC readers against the same DB logs. Each writes to separate Kafka pipelines. One failure doesn't stop event flow.
  • Lag monitoring — alert within seconds: CDC lag = difference between latest DB commit and latest event published to Kafka. Should be < 1 second normally. Alert immediately if lag exceeds 10 seconds.
  • Connector health checks: Monitor connector status independently from lag. A connector can appear healthy but produce zero events.
  • Application-level fallback (critical events): For the most critical payment state transitions, app also publishes a minimal event directly to Kafka if CDC hasn't confirmed within an SLO window. Belt-and-suspenders for the highest-stakes changes.

Recovery

  1. Identify the gap: last event timestamp in Kafka vs current DB state.
  2. Replay from WAL/binlog offset: CDC connectors (Debezium) can be pointed at a specific log offset to replay missed events.
  3. Backfill consumers: replay events to audit service, reconciliation, webhook delivery.
  4. Run reconciliation immediately to detect any transactions that need merchant notification.

Interview takeaway

  • CDC failure is uniquely dangerous because it's silent — DB and event stream diverge while the application appears completely healthy
  • Lag monitoring (< 1s normal, alert at 10s) is the primary detection mechanism — connector health checks alone are insufficient
  • Redundant CDC instances reading the same WAL independently provide HA for the event stream without DB changes
  • For critical payment state transitions, add an application-level fallback publish if CDC confirmation breaches an SLO — belt-and-suspenders

Scenario 15: Distributed Lock Fails or Expires Mid-Operation

A worker holds a lock, pauses mid-operation (GC, OS scheduling), the lock TTL expires, a second worker acquires the same lock — both now believe they own the resource and execute simultaneously.

Most relevant for: Any system using Redis SETNX or Redlock for critical sections — ticketing, inventory decrement, payment deduplication, driver assignment.

System: Workers → Redis (SETNX / Redlock) or etcd (leases)

What breaks

Scenario A — GC pause / clock drift (Redlock race):

Worker A acquires lock on 3/5 Redis nodes (Redlock)
Worker A pauses (JVM GC, 8 second pause)
Lock TTL expires on all Redis nodes
Worker B acquires same lock on 3/5 nodes
Worker A resumes, believes it still holds the lock
Both A and B execute the critical section simultaneously
→ double charge, double ticket booking, two drivers assigned to same rider

Scenario B — etcd cluster loses quorum:

etcd 3-node cluster: 2 nodes die → cluster loses quorum
Workers holding existing leases: leases expire (TTL not renewable)
New lock requests: fail — etcd cannot process them
→ distributed cron job doesn't run (safe — missed, not duplicated)
→ leader election stalls — no new primary is promoted
→ Patroni cannot fail over your Postgres primary

Scenario C — Deadlock:

Worker A holds lock-order, tries to acquire lock-inventory
Worker B holds lock-inventory, tries to acquire lock-order
Both wait forever — no progress, no errors surfaced

Scenario D — TTL expires mid-payment:

User A acquires Redis lock on seat-123 (TTL = 10 min)
User A fills in card details — takes 9 min 50 sec
TTL expires at minute 10
User B acquires same lock, starts checkout
User A's payment completes at minute 11
Both try to write "BOOKED" to DB
→ OCC version check (or SELECT FOR UPDATE) fires: only one write succeeds
→ loser gets a conflict error → automatic refund issued

Cascade

  1. Lock expires while worker is mid-operation (GC, slow network, high load)
  2. Second worker acquires the same lock — both believe they own it
  3. Both execute the critical section — double-write, double-charge, double-assignment
  4. DB is the only thing that can catch this — if it doesn't, corruption is silent

At peak (10× traffic)

A GC pause that lasts 2 seconds at idle may last 15 seconds under memory pressure at peak — far exceeding typical 30-second lock TTLs. TTL-mid-payment races also become more frequent when checkout servers are slow under load.

Prevention

For Redis SETNX / Redlock races:

  • DB OCC as the non-negotiable safety net: Even if the distributed lock fails entirely, a @Version check (Hibernate/JPA) or UPDATE ... WHERE version = X ensures only one write wins at the DB. The distributed lock is a UX optimisation (seat appears reserved to others); OCC is what actually prevents double booking.
  • Use fencing tokens when the resource supports it: Pass etcd's revision number to the protected resource. Resource rejects stale tokens. Eliminates the race entirely.
  • Set TTL generously — extend on payment initiation: Set TTL to 10+ minutes for checkout. When the user clicks "Pay Now", extend the lock TTL by another 2–3 minutes so payment processing has headroom.
  • Lease renewal (heartbeat): Worker renews the lock TTL every N seconds while alive. Lock only expires if the worker genuinely dies. etcd leases support this natively; Redis requires periodic PEXPIRE calls.
  • Use etcd for correctness-critical locks: Financial transactions, inventory decrement, leader election — anywhere double-execution is not survivable.

For deadlocks:

  • Consistent lock ordering: Always acquire locks in a fixed global order (e.g. alphabetical by resource ID). If all workers acquire lock-inventory before lock-order, the circular wait is impossible.
  • Lock acquisition timeout: Set a timeout when trying to acquire a lock (e.g. 5 seconds). If you can't acquire within the timeout, release what you hold, back off with jitter, retry. Breaks the deadlock instead of waiting forever.
  • Centralize lock acquisition: Scattered lock calls across the codebase make ordering impossible to enforce. One lock manager in one place makes the ordering easy to audit.

Recovery

Race condition / TTL-mid-payment (double booking attempt):

  1. OCC version check at the DB is the catch — only one transaction commits, the other gets a conflict error.
  2. For the loser: issue an automatic refund via the payment processor using the original payment intent ID.
  3. Notify the user: "Sorry, that seat was just taken — your refund is on its way." Show alternative seats.
  4. Audit log records both attempts for reconciliation.

etcd cluster loses quorum:

  1. At least one failed node must recover to restore majority quorum.
  2. Once quorum restores, lock acquisitions and lease renewals resume normally.
  3. Review operations blocked during the outage — missed cron jobs may need a manual trigger.

Deadlock detected (hung workers):

  1. Kill the stuck workers — locks are released on TTL expiry or immediately on crash.
  2. New workers start fresh and acquire locks in the correct order.
  3. Fix the ordering bug in code before restarting — otherwise the deadlock recurs.

Interview takeaway

  • DB OCC (version column + UPDATE ... WHERE version = X) is the non-negotiable safety net — the distributed lock is a UX optimisation, not a correctness guarantee
  • Fencing tokens (etcd revision numbers passed to the resource) make lock safety provable — Redlock alone cannot guarantee correctness under GC pauses
  • Consistent lock ordering (alphabetical by resource ID) eliminates deadlock without needing detection or timeouts
  • Use etcd over Redis for correctness-critical locks — etcd leases + fencing tokens eliminate the Redlock race; Redis SETNX does not

Scenario 16: Cache Stampede on TTL Expiry

A popular cache entry expires. All concurrent requests see a cache miss simultaneously and race to rebuild it from the DB — a self-inflicted DDoS from your own application.

Most relevant for: Any system with high-traffic cached data — social feeds, product pages, event listings, URL shorteners.

System: App Server → Redis → DB

What breaks

Failure flow:

Homepage cache entry: 100,000 reads/sec, 1-hour TTL
Minute 60: TTL expires
All 100,000 in-flight requests see cache miss in the same instant
All 100,000 try to fetch from DB simultaneously
DB sized for ~1,000 cache-miss queries/sec now gets 100,000
→ DB CPU pegs at 100%, queries timeout
→ App servers all blocking on DB connection waits
→ Site goes down from read traffic alone

Key risk: Unlike Scenario 1 (Redis crashes), the cache is running fine. All other keys are served normally. Only this one expired key triggers the cascade. The blast radius is proportional to how popular the key is and how expensive the rebuild is.

Cascade

  1. Popular cache key expires — all concurrent readers see cache miss simultaneously
  2. All readers race to rebuild — each queries DB independently
  3. DB receives orders-of-magnitude more load than its steady-state cache-miss rate
  4. DB CPU pegs, queries timeout, connection pool exhausts
  5. App servers hang waiting for DB connections — site degrades

At peak (10× traffic)

The problem is worst exactly when you can least afford it — viral content at peak hours has the highest concurrent readers when its TTL expires.

Prevention

  • Probabilistic early refresh: As a cache entry ages, each request has a small increasing probability of triggering a background refresh. At 50 min into a 60-min TTL: ~1% chance. At 59 min: ~20%. By expiry, the entry has already been quietly refreshed. No stampede ever occurs. This is the cleanest general solution.
  • Background refresh process: A dedicated job refreshes critical cache entries before they expire (e.g. refresh every 50 min on a 60-min TTL). Guarantees entries never go cold. Best for your highest-traffic keys where even probabilistic refresh feels risky.
  • Request coalescing: When multiple requests see the same cache miss, only one fetches from the DB; the rest wait for that result. DB receives at most N requests (one per app server), not millions. Doesn't prevent the miss but limits the rebuild blast radius.
  • TTL jitter: Add random(0, 30s) to TTLs so entries for different objects don't expire simultaneously. Prevents coordinated stampedes across many keys at once. Simple and effective as a baseline.
  • Versioned cache keys with background refresh: Don't expire the old key — write a new versioned key before the old one goes stale. Old readers continue serving the previous version; new readers get the fresh version. Zero gap.

Recovery

Unlike most failures, a cache stampede recovers automatically once the DB survives the initial burst and the cache is repopulated:

  1. First request to rebuild successfully populates the cache — all subsequent requests hit cache normally.
  2. If the DB is overwhelmed and queries are timing out: circuit breaker trips, returns degraded response (empty feed, cached-but-stale data from L1 in-process cache if available).
  3. Once circuit opens and DB gets breathing room, it recovers. Circuit half-opens, probes, closes on success.
  4. Root fix: add probabilistic refresh or background refresh for the affected key. Don't just restart and hope — the next TTL expiry will cause the same stampede.

Interview takeaway

  • Stampede is self-inflicted — the cache is healthy; one TTL expiry causes a DB DDoS proportional to key popularity
  • Probabilistic early refresh is the cleanest general solution — keys are quietly refreshed before expiry, stampede never occurs
  • Request coalescing limits rebuild blast radius to N requests (one per app server), not millions — doesn't prevent the miss but controls the damage
  • TTL jitter is the cheapest baseline protection — prevents coordinated mass expiry across many keys simultaneously

Scenario 17: Virtual Queue System Fails During High-Demand Event

The Redis-backed virtual queue crashes or becomes unavailable mid-event. Users lose their queue position, SSE connections drop, and the Booking Service loses its admission gate — potentially allowing uncontrolled direct access to the booking page.

Most relevant for: Ticketmaster-style high-demand events, flash sales, limited-inventory releases.

System: Users → Queue Service (Redis ZSET) → SSE → Booking Service

What breaks

Failure flow (Taylor Swift on-sale):

Taylor Swift on-sale: 500,000 users in queue, Redis queue cluster goes down

Immediate:
→ SSE connections drop — all users lose queue position updates
→ ZRANK queries fail — queue positions unknown
→ SADD admitted:{eventId} fails — no new users can be admitted
→ SISMEMBER admitted:{eventId} fails — Booking Service can't verify admission

Two bad outcomes depending on how Booking Service handles the Redis failure:
  A) Fail-closed: reject all bookings (no one can buy tickets)
  B) Fail-open: admit everyone (500K users hit booking page simultaneously → system overload)

Key risk — fail-open: If Booking Service defaults to allowing all requests when Redis is unavailable, the entire load that the queue was protecting against hits the backend at once. Queue failure at peak = instant full load.

Cascade

  1. Redis queue cluster goes down — SSE connections drop
  2. 500K users refresh or reconnect simultaneously — thundering herd on Queue Service
  3. Booking Service can't verify admission — must choose fail-closed or fail-open
  4. Fail-open: 500K users hit Booking Service at once — inventory DB overwhelmed
  5. Fail-closed: all bookings blocked for duration of outage — revenue loss

At peak (10× traffic)

Queue failures are most likely exactly when the queue matters most — Redis under maximum connection and memory pressure during a massive on-sale event.

Prevention

  • Redis Cluster for the queue: Run the queue on a Redis Cluster (3+ nodes). Single Redis node for a system this critical is a SPOF. Cluster tolerates node failures without losing queue state.
  • Fail-closed by default, not fail-open: If the Booking Service cannot reach Redis to verify admission, reject the request with a 503 and a "queue is temporarily unavailable, please wait" message. Never default to admitting everyone on Redis failure.
  • Persist queue position to a secondary store: Periodically snapshot queue state (user positions) to DB. On Redis recovery, restore from snapshot. Users lose at most N seconds of queue advancement, not their entire position.
  • SSE client auto-reconnect: Browsers reconnect SSE automatically. On reconnect, the server should restore the user's position from their session/cookie + Redis ZRANK query. From the user's perspective: brief disconnection, then position restored.
  • Graceful degradation mode: If the queue cluster is fully down, switch to a simpler rate-limiting mode: cap the booking page at N concurrent sessions using a token bucket or semaphore, returning 429 for excess traffic.

Recovery

  1. Redis Cluster promotes a replica automatically — queue state survives if using Cluster with replicas.
  2. SSE clients reconnect and receive their restored position. Users see a brief disruption, not a lost place.
  3. If queue state was lost entirely: restore from DB snapshot. Announce the disruption ("we're restoring your place in line") — transparency reduces frustration.
  4. Resume admitting users at the normal dequeue rate. Don't burst-admit a backlog — that defeats the purpose of the queue.

Interview takeaway

  • Fail-closed on Redis failure is mandatory — fail-open admits 500K users simultaneously, which is worse than having no queue at all
  • Redis Cluster (not single node) for queue storage — a virtual queue is a SPOF if its backing store is a SPOF
  • Queue position must be snapshotted to DB periodically — otherwise Redis failure means every user loses their place
  • Graceful degradation to token-bucket rate limiting provides a simpler admission gate when the full queue cluster is down

Scenario 18: Stale Location Data — Ghost Drivers

The stale driver cleanup job fails or runs too slowly. Offline drivers remain in the Redis geo set. Riders get matched to drivers who are no longer active — they accept a ride, wait, and the driver never arrives.

Most relevant for: Ride-sharing (Uber, Lyft), food delivery, any system matching users to nearby moving entities.

System: Driver Apps → Location Service → Redis Geo Set → Matching Service

What breaks

Failure flow:

Driver goes offline (app killed, phone dies, shift ends)
Cleanup job fails (Redis timeout, deployment issue, backlog)
→ driver:42 remains in drivers:nyc geo set with last known position
→ rider requests match → GEOSEARCH returns driver:42 as nearest
→ rider accepts → match request sent → no response → timeout
→ rider re-requests → potentially matched to same ghost again
→ user experience: waiting 2+ minutes for a driver who never comes

Cascade

  1. Cleanup job fails — offline drivers accumulate in geo set
  2. Match quality degrades city-wide — GEOSEARCH returns a mix of active and ghost drivers
  3. Match success rate drops — riders waiting, retry traffic surges
  4. Matching service starts thrashing on failed matches
  5. Each failed match takes the full timeout (10+ seconds) — overall match latency spikes

At peak (10× traffic)

Ghost drivers are hardest to detect during high-volume periods when the Location Service is under load and update timestamps are delayed. Rush hour failure compounds with ghost accumulation.

Prevention

  • Companion timestamp set: ZADD driver_timestamps <unix_timestamp> "driver:42" on every location update. Cleanup job uses ZRANGEBYSCORE driver_timestamps 0 <now-30s> to identify stale drivers. Without this, you have no efficient way to find stale entries.
  • Short cleanup interval: Run cleanup every 10–15 seconds, not every minute. Stale data window stays small.
  • Driver heartbeat separate from GPS update: Even if a driver is stationary, send a heartbeat every 5 seconds to refresh the timestamp. Prevents stationary drivers being incorrectly evicted.
  • Monitor geo set size: Alert if drivers:nyc member count grows faster than expected. A stuck cleanup job shows up as unbounded growth.
  • Match confirmation timeout: If a matched driver doesn't respond within 10 seconds, mark them inactive and re-match. Last line of defence against ghost matches reaching the rider.

Recovery

  1. Fix and restart the cleanup job. Run an immediate full sweep: ZRANGEBYSCORE driver_timestamps 0 <now-30s> → remove all stale entries from both sets.
  2. Monitor geo set size returning to expected levels.
  3. Audit recent matches for ghost driver patterns — affected rides may need compensation.

Interview takeaway

  • Companion timestamp set (ZADD with unix timestamp on every update) is what makes stale detection efficient — without it, there's no way to efficiently find old entries in a geo set
  • Match confirmation timeout (10s no-response → re-match) is the last line of defence — prevents ghost matches from reaching the rider even when cleanup lags
  • Monitor geo set size — a stuck cleanup job shows up as unbounded growth before riders start complaining
  • Stationary driver heartbeat (every 5s) prevents incorrectly evicting parked/waiting drivers from the geo set

Scenario 19: Location Write Storm

Rush hour begins, a major event ends, or a city-wide incident causes thousands of drivers to start moving simultaneously. Location update volume spikes 5–10× baseline in seconds. The Location Service is overwhelmed — updates queue up, Redis writes lag, and driver positions shown to riders become stale.

Most relevant for: Ride-sharing, food delivery, any system tracking millions of moving entities.

System: Driver Apps → Location Service → Kafka → Redis Geo Set

What breaks

Traffic comparison:

Normal:      500K location updates/sec → Location Service → Redis
Rush hour:   2.5M location updates/sec → Location Service overwhelmed

Failure flow:

Location Service:
→ threads all busy, request queue grows
→ Redis connection pool exhausts
→ writes start timing out
→ GEOADD failures → driver positions not updated
→ proximity search returns positions that are 30–60 seconds stale
→ riders dispatched to drivers who have moved significantly
→ mismatch between app-shown driver location and actual position

Cascade

  1. Write volume spikes — Location Service threads saturate
  2. Redis connection pool exhausts — GEOADD writes start timing out
  3. Driver positions freeze — geo set reflects where drivers were, not where they are
  4. Matching service dispatches riders to wrong locations — match quality collapses
  5. Failed or slow matches generate retry traffic — compounds the overload

At peak (10× traffic)

The spike is sharpest at the start of rush hour or immediately after a large event ends — a near-instant step function, not a gradual ramp. Auto-scaling can't react fast enough.

Prevention

  • Kafka as write buffer: Driver apps → Location Service → Kafka topic. Consumer batches and coalesces by driver ID before writing to Redis. Kafka absorbs the spike; Redis write rate stays controlled.
  • Write coalescing: Deduplicate by driver ID in the consumer, keeping only the latest position per flush interval (100ms). A 5× traffic spike with 80% coalescing = only 1× Redis write increase.
  • Multi-member GEOADD + pipelining: Batch multiple driver updates into a single Redis command. Reduces round trips proportionally.
  • Region sharding: drivers:nyc, drivers:la, etc. Rush hour in NYC doesn't spike drivers:la. Each shard has its own write capacity.
  • Graceful degradation: If Redis write lag exceeds threshold, increase the client-side batch interval from 1 second to 3 seconds. Reduces incoming update volume at the cost of slightly less precise positioning.

Recovery

  1. Kafka consumer lag tells you how far behind writes are — monitor this as the primary indicator of write storm severity.
  2. Scale Location Service consumers horizontally — add more consumer instances to drain the Kafka backlog faster.
  3. Once backlog clears, Redis geo set reflects current positions again.
  4. Post-mortem: pre-provision extra capacity before known high-demand periods (New Year's Eve, major sporting events).

Interview takeaway

  • Kafka as write buffer decouples driver app throughput from Redis write capacity — the spike hits Kafka, Redis sees controlled load
  • Write coalescing by driver ID (keep only latest per flush interval) converts a 5× traffic spike into ~1× Redis write increase
  • Region sharding (drivers:nyc, drivers:la) isolates rush-hour spikes geographically — one city's storm doesn't affect others
  • Kafka consumer lag is the primary indicator of write storm severity — monitor before Redis starts missing writes

Scenario 20: Upload Processing Pipeline Backs Up

Files are uploaded successfully but workers can't keep up — objects pile up in the raw bucket, never visible to users, and there's no immediate error signal.

Most relevant for: Photo sharing, document storage, any system with a post-upload processing pipeline (Dropbox, Instagram, Google Drive).

System: Upload Service → Raw Object Storage → SQS Queue → Processing Workers → Processed Bucket → CDN

What breaks

Upload succeeds — the object lands in the raw bucket — but processing workers fall behind. The file is stuck between upload and serving: not visible to users, not in the processed bucket, not behind CDN. From the user's perspective, their upload appears to have silently failed. The SQS queue depth grows. If workers crash or back off on errors, the backlog compounds.

Causes:

  • Sudden spike in uploads (viral share, batch import, new feature launch)
  • Worker crash or deployment that takes workers offline
  • One or more files that are expensive to process (unusually large, corrupt format that loops, heavy ML moderation load)
  • DLQ not configured — failed messages requeue indefinitely and block healthy ones

Cascade

  1. Upload spike arrives — queue depth grows faster than workers drain it
  2. Workers hit a corrupt or expensive file — crash or spin on retries
  3. Without a DLQ, failed messages block healthy ones — healthy uploads also delayed
  4. Users see no feedback — assume upload failed — re-upload — compounds the queue
  5. Queue depth grows unbounded — backlog may take hours to drain

At peak (10× traffic)

Workers are CPU-bound — they can't process faster, only more workers help. Auto-scaling on CPU won't trigger fast enough; queue depth is the right signal.

Prevention

  • Alert on queue depth: Set a CloudWatch alarm when the SQS queue depth exceeds a threshold (e.g. 1,000 messages or 15 minutes of backlog at normal throughput). Page on-call before users notice.
  • Auto-scale workers on queue depth: Processing workers are stateless and embarrassingly parallel — each file processes independently. Use queue depth as the auto-scaling signal, not CPU.
  • Configure a Dead Letter Queue (DLQ): Messages that fail after N retries (e.g. 3) move to the DLQ instead of looping. Prevents one corrupt file from blocking the queue and lets healthy files process normally.
  • Make every processing step idempotent: Virus scan → EXIF strip → resize → moderation. If a worker crashes mid-pipeline and the message requeues, re-running from the start must be safe. Workers should check if a step is already done before re-executing.
  • Show "processing" state to users: Update FileMetadata status to processing after upload and show a progress indicator. Set the expectation that the file will be available shortly.

Recovery

Scale up worker instances — processing is CPU-bound and scales horizontally. The queue drains as workers catch up. No data loss — objects are durably stored in the raw bucket. Files in the DLQ require manual inspection: fix the root cause (corrupt file, worker bug), then replay from the DLQ.

Interview takeaway

  • Auto-scale workers on queue depth, not CPU — processing is embarrassingly parallel; queue depth is the leading indicator, CPU is lagging
  • DLQ is non-negotiable — one corrupt file can block all healthy uploads indefinitely without it
  • All processing steps must be idempotent — crashed mid-pipeline workers requeue and restart from the beginning; re-running must be safe
  • No data loss risk — source object in raw bucket is the ground truth; a backed-up queue is a latency problem, not a durability problem

Scenario 21: Load Shedding Triggers a Retry Storm

The system starts shedding load to protect itself — but clients aren't told to back off, so they retry immediately. The 503s generate more traffic than the original spike.

Most relevant for: Any high-traffic service with load shedding or rate limiting (e-commerce checkout, payment APIs, booking systems).

System: Clients → API Gateway → App Servers (load shedding active)

What breaks

Load shedding fires correctly — the system is at capacity and starts returning 503s to P1/P2 requests. But the 503 responses carry no Retry-After header. Every client retries immediately with exponential backoff starting at zero or with aggressive defaults. The result: each shed request generates 2–5 retry attempts in the next few seconds. The system was at 100% capacity. It is now at 200–500%.

Causes:

  • 503 responses missing Retry-After header
  • Client retry logic that doesn't respect Retry-After even when present
  • No jitter in client backoff — all clients retry at the same moment
  • Multiple layers of clients (mobile app → BFF → internal service) each retrying independently, multiplying the effect

Cascade

  1. System hits capacity — load shedding activates, 503s go out
  2. Clients see 503, no Retry-After — retry immediately
  3. Retry wave arrives — system is still at 100%, now seeing 300% traffic
  4. More 503s go out — more retries — load spirals upward
  5. System cannot recover because load never drops — circuit breakers may trip
  6. Eventually OOM or thread exhaustion — what load shedding was meant to prevent

Prevention

  • Always include Retry-After on every 503: The single most important control. Tells clients exactly when to retry. Staggers retries across the window instead of concentrating them.
    HTTP 503 Service Unavailable
    Retry-After: 10
  • Add jitter to client backoff: Even with Retry-After, if all clients received the 503 at the same time, they'll all retry at now + 10s. Add ±2–3s of random jitter to spread the retry window.
  • Shed at the outermost layer first: API Gateway or load balancer sheds before requests reach app servers. Clients get 503 early — before internal retries between services can amplify the load.
  • Define client retry budgets: Mobile clients should retry at most 2–3 times with backoff. Internal services should respect Retry-After and not retry on 503 at all — bubble the error up instead.

Recovery

Once Retry-After is in place and clients respect it, the system gets a quiet window of N seconds. During that window: load drops below capacity, load shedding deactivates, queue drains. When retries resume they hit a healthy system. Without that window, recovery is impossible — the system never gets below capacity.

Interview takeaway

  • Retry-After is not optional on any 503 — it's the mechanism that makes load shedding self-healing instead of self-defeating
  • Retry amplification compounds at every layer — a 3× retry multiplier at mobile, 2× at BFF, 2× at internal service = 12× the original load
  • Shed at the outermost layer to minimize internal amplification
  • Jitter on client backoff is as important as the backoff itself — synchronized retries are a thundering herd