Redis Study Guide

Covers: Data Types, Caching Patterns & Invalidation, Cache Stampede, Sorted Sets, Pub/Sub, Geo, Distributed Locks, Persistence, Replication & Scaling

Redis is a single-threaded in-memory data store. Commands execute one at a time, in order — no lock contention, and operations like INCR are atomic by default. This is what makes Redis safe for distributed coordination without explicit synchronization. Redis 6+ added I/O threads for network reads/writes, but command execution remains sequential. The bottleneck is almost never the thread — it's network round trips, which pipelining solves.

1. Data Types — The Swiss Army Knife

Type	Commands	Use For
String	`SET`, `GET`, `INCR`, `SETNX`	Counters, session tokens, feature flags, distributed locks
Hash	`HSET`, `HGET`, `HMGET`	User profiles, structured objects — one key, many fields
List	`LPUSH`, `RPUSH`, `LRANGE`, `BLPOP`	Message queues, activity feeds, task queues
Set	`SADD`, `SMEMBERS`, `SINTERSTORE`	Unique visitors, tag sets, friend relationships
Sorted Set	`ZADD`, `ZRANGE`, `ZRANK`, `ZREVRANGE`	Leaderboards, ranked feeds, priority queues
HyperLogLog	`PFADD`, `PFCOUNT`	Approximate unique count (~0.81% error) — DAU, unique page views
Geo	`GEOADD`, `GEOSEARCH`, `GEODIST`	Proximity search, "find nearest driver"
Streams	`XADD`, `XREAD`, `XGROUP`	Persistent log with consumer groups — Kafka-lite
Pub/Sub	`PUBLISH`, `SUBSCRIBE`	Fire-and-forget fan-out to all current subscribers

2. Caching Patterns

Pick One Per Use Case

Pattern	How It Works	Use When
Cache-aside (lazy)	App checks cache first. On miss, reads DB, writes to cache.	General reads — cache only what's actually requested
Write-through	On every write, app writes to cache AND DB together.	Read-heavy data that must stay fresh (user profiles)
Write-behind (write-back)	App writes to cache only. Cache flushes to DB asynchronously.	Write-heavy. Risk: data loss if cache crashes before flush
Read-through	Cache sits in front of DB. Cache fetches from DB on miss automatically.	Simplifies app code — cache layer handles DB reads

Eviction Policies

LRU (Least Recently Used) — evict the item not accessed for the longest time. Default for most use cases.
LFU (Least Frequently Used) — evict the item accessed least often. Better for skewed access patterns.
TTL (Time To Live) — expire entries after a fixed time. Good for data that goes stale.

TTL Tiering — Driven by Your NFRs

Don't pick arbitrary TTLs — derive them from consistency requirements:

Data Type	TTL	Rationale
Static data (venue info, product descriptions)	Hours	Rarely changes
User profiles	~5 min	Changes infrequently but must stay reasonably fresh
Volatile data (seat availability, inventory counts)	Seconds	Stale data causes real harm
Search results	Match your stated SLA	If NFR says "30s stale OK," that's your TTL

When NOT to Cache

Real-time collaborative systems — Google Docs, multiplayer games. Every change must be immediately visible. Caching actively hurts.
Strongly consistent financial data — account balances, inventory when overselling is catastrophic.
Write-heavy systems — if read/write ratio is ~1:1 or 2:1, hit rates are too low to justify the complexity.
User-specific private data — private messages, personal preferences. Only one user ever requests them — no cache hit rate benefit.

3. Cache Invalidation

Most production systems combine approaches — short TTL as a safety net plus active invalidation for critical data.

Strategy	How It Works	Best For
TTL expiry	Entry expires automatically after a fixed time. No coordination needed.	Data that can tolerate some staleness
Write-through	App deletes or updates cache immediately when it writes to DB, in the same code path.	User profiles, inventory counts — must stay fresh
Event-driven	CDC (Debezium) reads DB WAL, publishes change events to a queue, cache invalidation consumer acts on them. Better than DB triggers — no hidden coupling.	Cross-service invalidation, complex dependency graphs
Tagged invalidation	Cache entries tagged (`user:123:posts`). On update, invalidate all entries sharing that tag.	Post update that should clear feed, search, and profile caches
Versioned keys	Include version in key (`event:123:v42`). On update, increment version → new key (`event:123:v43`). Old entries unreachable, expire via TTL. No race conditions.	Entity-level data needing immediate consistency without invalidation broadcasts
Deleted items cache	Small cache of recently deleted/modified IDs. Filter against it when serving lists or feeds.	Content moderation, soft deletes, privacy changes

4. Cache Stampede (Thundering Herd on Expiry)

When a popular cache entry expires, all concurrent requests see a cache miss simultaneously and race to rebuild it from the DB. At 100K reads/sec, that's 100K identical DB queries in the same instant — the DB collapses under a self-inflicted load spike.

Three mitigations:

1. Probabilistic early refresh — as the entry ages, each request has a small increasing probability of triggering a background refresh. At 50 min into a 60-min TTL: 1% chance. At 59 min: 20%. By expiry, the entry is already refreshed. No stampede ever occurs. Cleanest solution for most cases.

2. Background refresh process — a dedicated job refreshes critical entries before TTL expires. Guarantees the entry never goes cold. Cost: wasted refreshes for unrequested keys.

3. Request coalescing — when multiple requests see the same miss, only one fetches from DB; the rest wait for that result.

Cache miss for "celebrity:123"
→ Server 1: acquires in-flight lock, fetches from DB, populates cache
→ Server 2: sees in-flight lock, waits for Server 1's result
→ DB hit exactly once per app server — not once per user

Hot key fanout — for extreme loads: store copies under multiple keys (feed:taylor-swift:1 … feed:taylor-swift:10). Readers randomly pick one. Spreads 500K req/sec across 10 keys. Tradeoff: memory × N copies, invalidation must clear all.

5. Sorted Sets — Leaderboards and Rankings

A set where every member has a numeric score. All range queries by score are O(log n).

Commands:

ZADD leaderboard 9500 "user:42" — add or update member with score
ZREVRANGE leaderboard 0 9 WITHSCORES — top 10 by score, descending
ZRANK leaderboard "user:42" — rank of a member (0-indexed from lowest)
ZREVRANK leaderboard "user:42" — rank from highest (0 = top)
ZRANGEBYSCORE leaderboard 8000 +inf — all members with score ≥ 8000

Limits:

Max members per key: ~4.3 billion
Memory per member: ~50–100 bytes overhead + key + value size
At 10M members: ~500MB–1GB RAM — comfortable on a typical instance
All operations O(log n)

Other uses beyond leaderboards:

Priority queues — score is priority level; pop with ZPOPMIN
Expiring job queues — score is execute-at timestamp; worker polls with ZRANGEBYSCORE jobs 0 <now>
Rate limiting — score is request timestamp; count members in sliding window with ZCOUNT

Real example: Gaming leaderboard — score is points, member is user ID. Top 10 is a single ZREVRANGE call. Rank lookup for any user is ZREVRANK — O(log n) regardless of how many players.

6. Pub/Sub — Fan-out Notifications

Publisher sends to a channel. All current subscribers receive it instantly. Fire-and-forget — no persistence, no queue.

PUBLISH  notifications:user:42  '{"type":"seat_reserved","seat":"A12"}'
SUBSCRIBE notifications:user:42   → receives the message instantly

Limitation: Messages are lost if no subscriber is listening at that moment. A subscriber joining late gets nothing from before they subscribed. Redis crash = all in-flight messages lost.

Limits:

Throughput: millions of messages/sec — no disk I/O
Channels: unlimited (memory bound — each active channel with subscribers uses RAM)
Not suitable for > 10K active channels with high-volume subscribers without careful memory planning

Pub/Sub vs Kafka

	Redis Pub/Sub	Kafka
Delivery guarantee	None — fire and forget	At-least-once, exactly-once
Message persistence	No	Yes (days/weeks)
Consumer lag	N/A — real-time only	Consumers can fall behind and catch up
Use when	Live presence, WebSocket fan-out across servers	Payment events, audit logs, anything you can't afford to lose

Real example: Chat service — WebSocket servers subscribe to room:{roomId}. When a user sends a message, publish once → all servers with subscribers receive it and push to their connected clients. No need to know which server holds which connection.

7. Geo — Proximity Queries

Redis Geo is a sorted set where the score is a 52-bit geohash. Enables radius and bounding box queries in O(n + log m). Best solution for high-frequency location updates at scale.

Commands:

GEOADD drivers -74.006 40.713 "driver:42" — add or update location (overwrites previous — always have latest)
GEOSEARCH drivers FROMLONLAT -73.99 40.72 BYRADIUS 5 km ASC COUNT 20 — nearest 20 within 5km
GEOSEARCH drivers FROMLONLAT -73.99 40.72 BYBOX 10 10 km ASC — within bounding box
GEODIST drivers driver:42 driver:99 km — distance between two members
GEOSEARCH drivers FROMMEMBER driver:42 BYRADIUS 5 km ASC — search from an existing member's position

GEOSEARCH (Redis 6.2+) replaces the older deprecated GEORADIUS and GEORADIUSBYMEMBER.

Precision levels:

Geohash length	Cell size	Use for
4	~20 km	City-level
5	~2.4 km	Neighborhood
6	~610 m	Street-level
8	~19 m	Building-level

Full ride-sharing pattern:

Driver update every 5 seconds:
  → GEOADD drivers <lng> <lat> "driver:42"
  → ZADD driver_timestamps <unix_ts> "driver:42"   ← companion set to track last-seen

Rider requests match:
  → GEOSEARCH drivers FROMLONLAT <lng> <lat> BYRADIUS 5 km ASC COUNT 20

Stale driver cleanup (runs every 30s):
  → ZRANGEBYSCORE driver_timestamps 0 <now - 30s>  ← drivers not updated in 30s
  → ZREM driver_timestamps <stale_ids>
  → ZREM drivers <stale_ids>

The companion driver_timestamps sorted set (score = last-seen timestamp) is essential — without it, offline drivers stay in the geo set forever and appear available to riders.

Durability note: Driver locations are ephemeral — every driver re-reports within 5 seconds. A Redis crash means ~5s of stale data before the set rebuilds. State this tradeoff explicitly in an interview. Mitigate with RDB snapshots or Redis Sentinel if needed.

Write optimization patterns for high-frequency updates:

Technique	How	When
Multi-member GEOADD	Batch multiple drivers in one command — one round trip for N updates	Always — zero tradeoff
Write coalescing	Deduplicate by driver ID before writing — if 3 updates arrive in 200ms, only the latest matters	Always — reduces Redis writes without staleness
Kafka as write buffer	Location Service → Kafka → consumer batches and flushes to Redis every 100ms	Rush hour spikes
Region sharding	Separate keys per city (`drivers:nyc`, `drivers:la`) on different nodes	Multi-city scale-out

Limits:

~4.3 billion members per key (same as sorted sets — it's built on them)
Coordinate precision: ~0.6mm at equator
Radius query: O(n + log m) — fast up to millions of members

8. Distributed Locks

Prevents two workers from processing the same job simultaneously. In-memory locks only exist within one instance — SETNX moves the lock into a shared, atomic, external store.

Why not a DB transaction lock? DB transaction locks hold for the life of a transaction — designed for millisecond durations. Distributed locks hold for longer (e.g. seat reserved for 10 minutes while user pays). Using a DB transaction for that would block the row and kill throughput.

Option 1 — Redis SETNX (simple, fast)

SET lock:job_123 worker_A NX PX 30000

NX = only set if not exists. PX 30000 = expire in 30 seconds (auto-release if worker crashes).

Fast (~1ms)
Single Redis node = single point of failure
Clock drift can theoretically cause two nodes to hold the lock simultaneously

Option 2 — Redlock (multiple Redis nodes)

Acquire lock on a majority (3 of 5) Redis nodes within a time window. More robust than single-node but controversial — a process can pause (JVM GC) after acquiring the lock, TTL expires across all nodes, and another process acquires it. Both now believe they hold it.

The Redlock debate: Martin Kleppmann argued it's unsafe due to process pauses. Antirez (Redis creator) rebutted that extreme GC pauses are unlikely in practice. Never fully resolved. Practical takeaway: Redlock is fine when occasional double execution is survivable.

Option 3 — etcd (strongest guarantee)

Uses Raft — lock acquisition is linearizable. No clock drift issues. Slower (~5–10ms). Use when correctness is mandatory.

Fencing Tokens — the Correct Solution for Strong Safety

Every etcd write returns a monotonically increasing revision number. Pass this as a token to the protected resource. The resource rejects any request with a lower token than the last seen.

Process A acquires lock → gets token 42
Process A pauses (GC)
Lock expires → Process B acquires → token 43 → writes with token 43
Process A wakes up → tries to write with token 42 → resource rejects (42 < 43) ✓

Redis has no equivalent. If you need this level of correctness, use etcd.

When to Use Each

Option	Use When
Redis SETNX	Job deduplication, idempotency, rate limiting — OK if occasional double execution is survivable
Redlock	Higher confidence without etcd — workloads tolerant of rare edge-case failures
etcd lock	Leader election, financial writes, inventory — correctness is mandatory

Real examples: Seat reservation (lock seat for 10 min while user pays), ride-share driver assignment (prevent matching one driver to two riders), distributed cron jobs (run on exactly one server).

9. Persistence

Mode	How	Data Loss on Crash	Use When
No persistence	DB is source of truth. Redis crash = cache gone, app rebuilds from DB.	All in-flight data	Pure cache — simplest, fastest
RDB snapshots	Dumps memory to disk every N seconds.	Writes since last snapshot	Fast recovery on restart ("warm" cache)
AOF	Every write logged to disk. `appendfsync everysec` = ~1s loss. `appendfsync always` = near-zero.	~1 second (default config)	Need durability without full sync overhead
Replicas	Primary handles writes, replicas handle reads. Promote replica on primary failure.	Milliseconds of replication lag	High availability + read scale-out

AOF + RDB can be combined — Redis uses both for recovery, whichever is more complete.

10. Replication and Scaling

Redis Sentinel — High Availability

Monitors a primary + replica setup. Auto-promotes a replica when the primary fails (~seconds). All data lives on one primary (replicated to replicas). Read traffic distributes across replicas.

Use when: dataset fits on one node, you need HA without sharding complexity.

Redis Cluster — Horizontal Scale

Shards data across multiple primaries using 16,384 hash slots. Each shard has its own primary + replicas with built-in per-shard failover.

Use when: dataset too large for one node OR write throughput exceeds single-node capacity.

Sentinel vs Cluster

	Redis Sentinel	Redis Cluster
Purpose	High availability — failover	Horizontal scale — sharding
Data distribution	All data on one primary	Split across multiple primaries
Failover	Sentinel promotes replica	Built-in, per shard
Complexity	Low	Higher — client must be cluster-aware

Capacity Reference

Metric	Value
Single instance RAM	~10–64GB practical sweet spot (max ~100GB)
Throughput	~100K–1M ops/sec
Max key or value size	512MB
Max keys per instance	~4 billion
Sorted set operations	O(log n)
Pub/Sub throughput	Millions of messages/sec