Redis Study Guide
Covers: Data Types, Caching Patterns & Invalidation, Cache Stampede, Sorted Sets, Pub/Sub, Geo, Distributed Locks, Persistence, Replication & Scaling
Redis is a single-threaded in-memory data store. Commands execute one at a time, in order — no lock contention, and operations like INCR are atomic by default. This is what makes Redis safe for distributed coordination without explicit synchronization. Redis 6+ added I/O threads for network reads/writes, but command execution remains sequential. The bottleneck is almost never the thread — it's network round trips, which pipelining solves.
1. Data Types — The Swiss Army Knife
| Type | Commands | Use For |
|---|---|---|
| String | SET, GET, INCR, SETNX |
Counters, session tokens, feature flags, distributed locks |
| Hash | HSET, HGET, HMGET |
User profiles, structured objects — one key, many fields |
| List | LPUSH, RPUSH, LRANGE, BLPOP |
Message queues, activity feeds, task queues |
| Set | SADD, SMEMBERS, SINTERSTORE |
Unique visitors, tag sets, friend relationships |
| Sorted Set | ZADD, ZRANGE, ZRANK, ZREVRANGE |
Leaderboards, ranked feeds, priority queues |
| HyperLogLog | PFADD, PFCOUNT |
Approximate unique count (~0.81% error) — DAU, unique page views |
| Geo | GEOADD, GEOSEARCH, GEODIST |
Proximity search, "find nearest driver" |
| Streams | XADD, XREAD, XGROUP |
Persistent log with consumer groups — Kafka-lite |
| Pub/Sub | PUBLISH, SUBSCRIBE |
Fire-and-forget fan-out to all current subscribers |
2. Caching Patterns
Pick One Per Use Case
| Pattern | How It Works | Use When |
|---|---|---|
| Cache-aside (lazy) | App checks cache first. On miss, reads DB, writes to cache. | General reads — cache only what's actually requested |
| Write-through | On every write, app writes to cache AND DB together. | Read-heavy data that must stay fresh (user profiles) |
| Write-behind (write-back) | App writes to cache only. Cache flushes to DB asynchronously. | Write-heavy. Risk: data loss if cache crashes before flush |
| Read-through | Cache sits in front of DB. Cache fetches from DB on miss automatically. | Simplifies app code — cache layer handles DB reads |
Eviction Policies
- LRU (Least Recently Used) — evict the item not accessed for the longest time. Default for most use cases.
- LFU (Least Frequently Used) — evict the item accessed least often. Better for skewed access patterns.
- TTL (Time To Live) — expire entries after a fixed time. Good for data that goes stale.
TTL Tiering — Driven by Your NFRs
Don't pick arbitrary TTLs — derive them from consistency requirements:
| Data Type | TTL | Rationale |
|---|---|---|
| Static data (venue info, product descriptions) | Hours | Rarely changes |
| User profiles | ~5 min | Changes infrequently but must stay reasonably fresh |
| Volatile data (seat availability, inventory counts) | Seconds | Stale data causes real harm |
| Search results | Match your stated SLA | If NFR says "30s stale OK," that's your TTL |
When NOT to Cache
- Real-time collaborative systems — Google Docs, multiplayer games. Every change must be immediately visible. Caching actively hurts.
- Strongly consistent financial data — account balances, inventory when overselling is catastrophic.
- Write-heavy systems — if read/write ratio is ~1:1 or 2:1, hit rates are too low to justify the complexity.
- User-specific private data — private messages, personal preferences. Only one user ever requests them — no cache hit rate benefit.
3. Cache Invalidation
Most production systems combine approaches — short TTL as a safety net plus active invalidation for critical data.
| Strategy | How It Works | Best For |
|---|---|---|
| TTL expiry | Entry expires automatically after a fixed time. No coordination needed. | Data that can tolerate some staleness |
| Write-through | App deletes or updates cache immediately when it writes to DB, in the same code path. | User profiles, inventory counts — must stay fresh |
| Event-driven | CDC (Debezium) reads DB WAL, publishes change events to a queue, cache invalidation consumer acts on them. Better than DB triggers — no hidden coupling. | Cross-service invalidation, complex dependency graphs |
| Tagged invalidation | Cache entries tagged (user:123:posts). On update, invalidate all entries sharing that tag. |
Post update that should clear feed, search, and profile caches |
| Versioned keys | Include version in key (event:123:v42). On update, increment version → new key (event:123:v43). Old entries unreachable, expire via TTL. No race conditions. |
Entity-level data needing immediate consistency without invalidation broadcasts |
| Deleted items cache | Small cache of recently deleted/modified IDs. Filter against it when serving lists or feeds. | Content moderation, soft deletes, privacy changes |
4. Cache Stampede (Thundering Herd on Expiry)
When a popular cache entry expires, all concurrent requests see a cache miss simultaneously and race to rebuild it from the DB. At 100K reads/sec, that's 100K identical DB queries in the same instant — the DB collapses under a self-inflicted load spike.
Three mitigations:
1. Probabilistic early refresh — as the entry ages, each request has a small increasing probability of triggering a background refresh. At 50 min into a 60-min TTL: 1% chance. At 59 min: 20%. By expiry, the entry is already refreshed. No stampede ever occurs. Cleanest solution for most cases.
2. Background refresh process — a dedicated job refreshes critical entries before TTL expires. Guarantees the entry never goes cold. Cost: wasted refreshes for unrequested keys.
3. Request coalescing — when multiple requests see the same miss, only one fetches from DB; the rest wait for that result.
Cache miss for "celebrity:123"
→ Server 1: acquires in-flight lock, fetches from DB, populates cache
→ Server 2: sees in-flight lock, waits for Server 1's result
→ DB hit exactly once per app server — not once per user
Hot key fanout — for extreme loads: store copies under multiple keys (feed:taylor-swift:1 … feed:taylor-swift:10). Readers randomly pick one. Spreads 500K req/sec across 10 keys. Tradeoff: memory × N copies, invalidation must clear all.
5. Sorted Sets — Leaderboards and Rankings
A set where every member has a numeric score. All range queries by score are O(log n).
Commands:
ZADD leaderboard 9500 "user:42"— add or update member with scoreZREVRANGE leaderboard 0 9 WITHSCORES— top 10 by score, descendingZRANK leaderboard "user:42"— rank of a member (0-indexed from lowest)ZREVRANK leaderboard "user:42"— rank from highest (0 = top)ZRANGEBYSCORE leaderboard 8000 +inf— all members with score ≥ 8000
Limits:
- Max members per key: ~4.3 billion
- Memory per member: ~50–100 bytes overhead + key + value size
- At 10M members: ~500MB–1GB RAM — comfortable on a typical instance
- All operations O(log n)
Other uses beyond leaderboards:
- Priority queues — score is priority level; pop with
ZPOPMIN - Expiring job queues — score is execute-at timestamp; worker polls with
ZRANGEBYSCORE jobs 0 <now> - Rate limiting — score is request timestamp; count members in sliding window with
ZCOUNT
Real example: Gaming leaderboard — score is points, member is user ID. Top 10 is a single ZREVRANGE call. Rank lookup for any user is ZREVRANK — O(log n) regardless of how many players.
6. Pub/Sub — Fan-out Notifications
Publisher sends to a channel. All current subscribers receive it instantly. Fire-and-forget — no persistence, no queue.
PUBLISH notifications:user:42 '{"type":"seat_reserved","seat":"A12"}'
SUBSCRIBE notifications:user:42 → receives the message instantly
Limitation: Messages are lost if no subscriber is listening at that moment. A subscriber joining late gets nothing from before they subscribed. Redis crash = all in-flight messages lost.
Limits:
- Throughput: millions of messages/sec — no disk I/O
- Channels: unlimited (memory bound — each active channel with subscribers uses RAM)
- Not suitable for > 10K active channels with high-volume subscribers without careful memory planning
Pub/Sub vs Kafka
| Redis Pub/Sub | Kafka | |
|---|---|---|
| Delivery guarantee | None — fire and forget | At-least-once, exactly-once |
| Message persistence | No | Yes (days/weeks) |
| Consumer lag | N/A — real-time only | Consumers can fall behind and catch up |
| Use when | Live presence, WebSocket fan-out across servers | Payment events, audit logs, anything you can't afford to lose |
Real example: Chat service — WebSocket servers subscribe to room:{roomId}. When a user sends a message, publish once → all servers with subscribers receive it and push to their connected clients. No need to know which server holds which connection.
7. Geo — Proximity Queries
Redis Geo is a sorted set where the score is a 52-bit geohash. Enables radius and bounding box queries in O(n + log m). Best solution for high-frequency location updates at scale.
Commands:
GEOADD drivers -74.006 40.713 "driver:42"— add or update location (overwrites previous — always have latest)GEOSEARCH drivers FROMLONLAT -73.99 40.72 BYRADIUS 5 km ASC COUNT 20— nearest 20 within 5kmGEOSEARCH drivers FROMLONLAT -73.99 40.72 BYBOX 10 10 km ASC— within bounding boxGEODIST drivers driver:42 driver:99 km— distance between two membersGEOSEARCH drivers FROMMEMBER driver:42 BYRADIUS 5 km ASC— search from an existing member's position
GEOSEARCH (Redis 6.2+) replaces the older deprecated GEORADIUS and GEORADIUSBYMEMBER.
Precision levels:
| Geohash length | Cell size | Use for |
|---|---|---|
| 4 | ~20 km | City-level |
| 5 | ~2.4 km | Neighborhood |
| 6 | ~610 m | Street-level |
| 8 | ~19 m | Building-level |
Full ride-sharing pattern:
Driver update every 5 seconds:
→ GEOADD drivers <lng> <lat> "driver:42"
→ ZADD driver_timestamps <unix_ts> "driver:42" ← companion set to track last-seen
Rider requests match:
→ GEOSEARCH drivers FROMLONLAT <lng> <lat> BYRADIUS 5 km ASC COUNT 20
Stale driver cleanup (runs every 30s):
→ ZRANGEBYSCORE driver_timestamps 0 <now - 30s> ← drivers not updated in 30s
→ ZREM driver_timestamps <stale_ids>
→ ZREM drivers <stale_ids>
The companion driver_timestamps sorted set (score = last-seen timestamp) is essential — without it, offline drivers stay in the geo set forever and appear available to riders.
Durability note: Driver locations are ephemeral — every driver re-reports within 5 seconds. A Redis crash means ~5s of stale data before the set rebuilds. State this tradeoff explicitly in an interview. Mitigate with RDB snapshots or Redis Sentinel if needed.
Write optimization patterns for high-frequency updates:
| Technique | How | When |
|---|---|---|
| Multi-member GEOADD | Batch multiple drivers in one command — one round trip for N updates | Always — zero tradeoff |
| Write coalescing | Deduplicate by driver ID before writing — if 3 updates arrive in 200ms, only the latest matters | Always — reduces Redis writes without staleness |
| Kafka as write buffer | Location Service → Kafka → consumer batches and flushes to Redis every 100ms | Rush hour spikes |
| Region sharding | Separate keys per city (drivers:nyc, drivers:la) on different nodes |
Multi-city scale-out |
Limits:
- ~4.3 billion members per key (same as sorted sets — it's built on them)
- Coordinate precision: ~0.6mm at equator
- Radius query: O(n + log m) — fast up to millions of members
8. Distributed Locks
Prevents two workers from processing the same job simultaneously. In-memory locks only exist within one instance — SETNX moves the lock into a shared, atomic, external store.
Why not a DB transaction lock? DB transaction locks hold for the life of a transaction — designed for millisecond durations. Distributed locks hold for longer (e.g. seat reserved for 10 minutes while user pays). Using a DB transaction for that would block the row and kill throughput.
Option 1 — Redis SETNX (simple, fast)
SET lock:job_123 worker_A NX PX 30000
NX = only set if not exists. PX 30000 = expire in 30 seconds (auto-release if worker crashes).
- Fast (~1ms)
- Single Redis node = single point of failure
- Clock drift can theoretically cause two nodes to hold the lock simultaneously
Option 2 — Redlock (multiple Redis nodes)
Acquire lock on a majority (3 of 5) Redis nodes within a time window. More robust than single-node but controversial — a process can pause (JVM GC) after acquiring the lock, TTL expires across all nodes, and another process acquires it. Both now believe they hold it.
The Redlock debate: Martin Kleppmann argued it's unsafe due to process pauses. Antirez (Redis creator) rebutted that extreme GC pauses are unlikely in practice. Never fully resolved. Practical takeaway: Redlock is fine when occasional double execution is survivable.
Option 3 — etcd (strongest guarantee)
Uses Raft — lock acquisition is linearizable. No clock drift issues. Slower (~5–10ms). Use when correctness is mandatory.
Fencing Tokens — the Correct Solution for Strong Safety
Every etcd write returns a monotonically increasing revision number. Pass this as a token to the protected resource. The resource rejects any request with a lower token than the last seen.
Process A acquires lock → gets token 42
Process A pauses (GC)
Lock expires → Process B acquires → token 43 → writes with token 43
Process A wakes up → tries to write with token 42 → resource rejects (42 < 43) ✓
Redis has no equivalent. If you need this level of correctness, use etcd.
When to Use Each
| Option | Use When |
|---|---|
| Redis SETNX | Job deduplication, idempotency, rate limiting — OK if occasional double execution is survivable |
| Redlock | Higher confidence without etcd — workloads tolerant of rare edge-case failures |
| etcd lock | Leader election, financial writes, inventory — correctness is mandatory |
Real examples: Seat reservation (lock seat for 10 min while user pays), ride-share driver assignment (prevent matching one driver to two riders), distributed cron jobs (run on exactly one server).
9. Persistence
| Mode | How | Data Loss on Crash | Use When |
|---|---|---|---|
| No persistence | DB is source of truth. Redis crash = cache gone, app rebuilds from DB. | All in-flight data | Pure cache — simplest, fastest |
| RDB snapshots | Dumps memory to disk every N seconds. | Writes since last snapshot | Fast recovery on restart ("warm" cache) |
| AOF | Every write logged to disk. appendfsync everysec = ~1s loss. appendfsync always = near-zero. |
~1 second (default config) | Need durability without full sync overhead |
| Replicas | Primary handles writes, replicas handle reads. Promote replica on primary failure. | Milliseconds of replication lag | High availability + read scale-out |
AOF + RDB can be combined — Redis uses both for recovery, whichever is more complete.
10. Replication and Scaling
Redis Sentinel — High Availability
Monitors a primary + replica setup. Auto-promotes a replica when the primary fails (~seconds). All data lives on one primary (replicated to replicas). Read traffic distributes across replicas.
Use when: dataset fits on one node, you need HA without sharding complexity.
Redis Cluster — Horizontal Scale
Shards data across multiple primaries using 16,384 hash slots. Each shard has its own primary + replicas with built-in per-shard failover.
Use when: dataset too large for one node OR write throughput exceeds single-node capacity.
Sentinel vs Cluster
| Redis Sentinel | Redis Cluster | |
|---|---|---|
| Purpose | High availability — failover | Horizontal scale — sharding |
| Data distribution | All data on one primary | Split across multiple primaries |
| Failover | Sentinel promotes replica | Built-in, per shard |
| Complexity | Low | Higher — client must be cluster-aware |
Capacity Reference
| Metric | Value |
|---|---|
| Single instance RAM | ~10–64GB practical sweet spot (max ~100GB) |
| Throughput | ~100K–1M ops/sec |
| Max key or value size | 512MB |
| Max keys per instance | ~4 billion |
| Sorted set operations | O(log n) |
| Pub/Sub throughput | Millions of messages/sec |