Technology Selection
Which technology to pick and why. Use this after NFRs tell you what the system needs — these are the decision criteria that map requirements to specific tools.
Database
Choose the right database for your access pattern, consistency requirements, and scale. Default to SQL unless you have a clear reason not to.
SQL vs NoSQL — Pick One First
The most important database decision — make this call before comparing specific products.
| Choose SQL when... | Choose NoSQL when... |
|---|---|
| Data is structured and relational | Data is unstructured, hierarchical, or variable schema |
| You need ACID transactions | You need massive write scale or eventual consistency is OK |
| Complex queries, joins, aggregations | Simple key-value or document lookups |
| Strong consistency required | Horizontal scale is the top priority |
| Data fits on one server or with replicas | Data must be sharded across many nodes |
| Team knows SQL | Column-family or document model fits the access pattern better |
Rule of thumb: Default to SQL (PostgreSQL). Only switch to NoSQL when you have a clear reason — scale, schema flexibility, or access pattern that SQL handles poorly.
SQL Databases
Relational databases with ACID guarantees — pick based on scale needs and whether you're on AWS or self-hosted.
| DB | Best For | Limits | Avoid When |
|---|---|---|---|
| PostgreSQL | General purpose. Complex queries, JSONB, full-text search, geospatial (PostGIS). | ~1–2TB comfortable, ~5–10K writes/sec before sharding | Need extreme write throughput or massive horizontal scale |
| MySQL | High-read web apps. Slightly faster reads than Postgres on simple queries. Battle-tested. | Similar to PostgreSQL | Need advanced features (window functions, JSONB) — use Postgres instead |
| CockroachDB | Distributed SQL. Horizontal scale with ACID. Multi-region. | Higher latency than single-node Postgres (~5–10ms vs 1–2ms) | Latency is critical and data fits on one node |
| Amazon Aurora | Managed PostgreSQL/MySQL. 5× faster than standard MySQL. Auto-scales storage. | AWS lock-in. More expensive. | On-prem or multi-cloud requirement |
NoSQL Databases
Databases optimized for specific access patterns — each trades ACID guarantees for a different combination of scale, flexibility, or query speed.
| DB | Model | Best For | Limits | Avoid When |
|---|---|---|---|---|
| Cassandra | Wide-column | High write throughput, time-series, event logs, messaging. Scales to PB. | Eventual consistency only. No joins. Query patterns must be known upfront — design tables around queries. | Strong consistency needed, or ad-hoc queries |
| DynamoDB | Key-value + document | Serverless scale, unpredictable traffic, AWS ecosystem. Auto-scales reads/writes. | AWS lock-in. Expensive at high scale. Limited query flexibility. | Complex queries or non-AWS stack |
| MongoDB | Document (JSON) | Flexible/nested data, content management, catalogs, user profiles. | Consistency weaker than SQL. Not great for relational data. | Highly relational data with lots of joins |
| Redis | Key-value + data structures | Cache, sessions, leaderboards, pub/sub, geo, rate limiting. Sub-ms reads. | Data must fit in RAM. Not a primary DB for large datasets. | Durability is critical without backup strategy |
| InfluxDB / TimescaleDB | Time-series | Metrics, monitoring, IoT sensor data, financial tick data. | Not general purpose. Queries outside time-range patterns are awkward. | Data isn't time-series |
Database Decision Tree
Follow this path from requirements to a specific database choice in under 30 seconds.
Need ACID transactions across multiple tables?
├── Yes → SQL (PostgreSQL default)
│ ├── Need horizontal write scale? → CockroachDB or shard PostgreSQL
│ └── Managed + AWS? → Aurora
└── No → What's the access pattern?
├── High write throughput + time-ordered? → Cassandra
├── Key lookups + AWS + auto-scale? → DynamoDB
├── Flexible/nested documents? → MongoDB
├── Cache + data structures + sub-ms? → Redis
└── Timestamped metrics/sensor data? → InfluxDB / TimescaleDB
Message Queue
Choose the right queue or stream based on durability needs, throughput, routing complexity, and whether you want managed infrastructure.
Kafka vs SQS vs RabbitMQ vs Redis Pub/Sub
Four queue options compared across the dimensions that matter most for picking one in an interview.
| Kafka | SQS | RabbitMQ | Redis Pub/Sub | |
|---|---|---|---|---|
| Durability | Yes — messages persisted to disk | Yes — managed by AWS | Yes | No — fire and forget |
| Replay | Yes — consumers can replay from any offset | No — consumed messages deleted | No | No |
| Multiple consumers | Yes — consumer groups each get all messages | No — one consumer gets each message | Yes — via exchanges/routing | Yes — all subscribers get each message |
| Throughput | Millions/sec | ~3,000/sec per queue (standard) | ~50K/sec | Millions/sec |
| Latency | 5–15ms | ~1–10ms | ~1–5ms | < 1ms |
| Ordering | Per partition | Best-effort (FIFO queues: strict but slower) | Per queue | Not guaranteed |
| Complexity | High — brokers, ZooKeeper/KRaft, partitions | Low — fully managed | Medium | Very low |
When to use each:
- Kafka: High throughput event streaming. Multiple independent consumers. Replay needed (audit, ML pipelines). Event sourcing. Use for: Uber GPS, analytics pipelines, activity feeds.
- SQS: Simple async task queue. AWS ecosystem. Don't want to manage infrastructure. Each task processed by one worker. Use for: email sending, image resizing, background jobs.
- RabbitMQ: Complex routing rules. Priority queues. Message TTL. Fan-out with filtering. Use for: task routing across microservices, job prioritization.
- Redis Pub/Sub: Real-time fan-out with no durability requirement. Extremely low latency. Use for: WebSocket fan-out, live notifications, chat room broadcasting.
Rule of thumb: Kafka for streams, SQS for tasks, RabbitMQ for routing, Redis Pub/Sub for real-time broadcast.
Cache
Redis is the default choice for almost every caching use case — pick Memcached only if you need pure string throughput with no other features.
Redis vs Memcached
Feature-by-feature comparison — Redis wins in almost every column, but Memcached has lower overhead for pure get/set workloads.
| Redis | Memcached | |
|---|---|---|
| Data structures | Strings, Lists, Sets, Sorted Sets, Hashes, Geo, Streams | Strings only |
| Persistence | Optional (RDB snapshots, AOF log) | None |
| Pub/Sub | Yes | No |
| Cluster / sharding | Redis Cluster built-in | Client-side sharding only |
| Threads | Single-threaded commands (I/O multi-threaded since v6) | Multi-threaded |
| Throughput | ~100K–1M ops/sec | ~1M+ ops/sec for simple get/set |
| Memory overhead | Higher (richer data structures) | Lower |
When to choose Memcached: Pure simple string cache with extremely high get/set throughput and no need for any Redis features. Rare — Redis handles this fine for most systems.
When to choose Redis: Everything else. Sorted sets, pub/sub, geo, persistence, Lua scripts, cluster support. Redis is the default choice.
Load Balancer
Distributes traffic across app servers — the choice depends on protocol, latency requirements, and whether you're on AWS or self-managed.
| AWS ALB | AWS NLB | Nginx / HAProxy | |
|---|---|---|---|
| Protocol | HTTP/HTTPS, WebSocket | TCP/UDP, TLS passthrough | HTTP, TCP, anything |
| Latency | ~1–5ms | ~100µs — extremely fast | ~1ms self-managed |
| Routing | Path-based, host-based, header-based | IP + port only | Highly configurable |
| SSL termination | Yes | No (passthrough) or Yes (TLS) | Yes |
| Managed | Fully (AWS) | Fully (AWS) | Self-managed |
| Use when | Default for web apps, APIs, WebSocket | Gaming, VoIP, financial, static IP needed | Not on AWS, or need fine-grained control |
Rule of thumb: Default to AWS ALB for web and API traffic. Use NLB only when you need sub-millisecond latency, UDP, or a static IP. Use Nginx/HAProxy when you're not on AWS or need custom configuration.
API Gateway
A single entry point for all client requests. Its primary purpose is routing — directing requests to the right backend service. Middleware (auth, rate limiting, logging) is secondary. Clients don't need to know your internal service structure.
Request flow:
Request → Validate → Middleware → Route → Backend → Transform → Cache → Response
What it handles (middleware responsibilities):
- Auth — JWT validation, API keys, OAuth token introspection
- Rate limiting — per user, per IP, per endpoint (see building_blocks for algorithms)
- SSL termination — decrypt HTTPS at gateway, plain HTTP internally (offloads CPU from backends)
- Request/response transformation — HTTP ↔ gRPC protocol translation, header injection, body format conversion
- Caching — cache full responses for non-user-specific endpoints (e.g. product catalog, public feeds). Never cache user-specific data.
- Logging and distributed tracing — inject trace IDs, forward to Datadog/Prometheus
- Circuit breaker — fail fast to a struggling backend service (see Reliability Patterns in building_blocks)
Two LB layers in practice:
[Clients]
↓
[Load Balancer] ← distributes across gateway instances (AWS ALB)
↓
[API Gateway cluster] ← stateless, scales horizontally
↓
[Backend Services] ← gateway load-balances across service instances
The gateway is stateless — no session data stored in the gateway itself. This makes it trivially horizontally scalable: add more instances behind the ALB.
Routing example:
/users/* → user-service:8080
/orders/* → order-service:8081
/payments/* → payment-service:8082
Protocol translation: Gateway can accept HTTP from clients and call backends over gRPC. Backend services use the most efficient protocol without the client needing to know.
Global distribution: For global users, deploy gateway instances in multiple regions + GeoDNS to route each user to the nearest gateway — same strategy as CDN edge nodes.
Technology options:
| AWS API Gateway | Kong | Nginx (as gateway) | No gateway | |
|---|---|---|---|---|
| Auth / JWT | Built-in | Plugin | Lua scripts | App handles it |
| Rate limiting | Built-in | Plugin | limit_req module | App handles it |
| Cost | Per-request (~$3.50/million) | Infrastructure cost | Infrastructure cost | Zero |
| Latency added | ~10ms | ~2–5ms | ~1–2ms | 0 |
| Managed | Fully | Self-managed | Self-managed | — |
| Protocol translation | HTTP/WebSocket only | HTTP, gRPC | HTTP, TCP | — |
| Use when | AWS ecosystem, serverless, public API | High volume, plugin ecosystem, gRPC | Already using Nginx as LB | Single service, internal API, low scale |
When you don't need one: Single-service apps, internal APIs, or when your app server already handles auth and rate limiting cleanly. A gateway adds a hop and complexity — only add it when cross-cutting logic would otherwise be duplicated across many services.
Real systems — what the gateway actually does:
| System | Scale | Key Gateway Responsibilities |
|---|---|---|
| Netflix | ~2B API req/day | Dynamic routing, A/B / canary testing, auth |
| Uber | 2000+ microservices | Multi-client routing (rider vs driver app), real-time WebSocket, geo-routing to regional services |
| 500M+ tweets/day | OAuth auth, heavy timeline caching, public API rate limiting per key | |
| E-commerce | Flash sale peaks | Rate limiting during flash sales, product catalog caching, /products/ /orders/ /cart/* routing |
| Chat (WhatsApp-style) | Millions of connections | WebSocket connection management, JWT auth, message rate limiting per user |
| Ride sharing | Continuous location updates | Separate rider/driver routing, real-time WebSocket for location, geo-routing |
Interview tip: Say "I'll add an API Gateway for routing and middleware" then move on. Don't over-explain the gateway — it's not the interesting part of the design. You can draw the Load Balancer and API Gateway as a single entry-point box if the interviewer doesn't ask about them specifically.
Rule of thumb: Microservices or multiple client types → API Gateway. Single service or internal API → skip it.
Object Storage
For files, images, and video. All major clouds have a native object store — pick based on your cloud ecosystem, then consider cost and compliance.
By cloud provider:
| AWS | GCP | Azure | |
|---|---|---|---|
| Object storage | S3 | Cloud Storage (GCS) | Blob Storage |
| Presigned URLs | S3 Presigned URLs | GCS Signed URLs | SAS (Shared Access Signature) tokens |
| CDN | CloudFront | Cloud CDN | Azure CDN / Front Door |
| CDN signed URLs | CloudFront Signed URLs | Cloud CDN Signed URLs | Azure CDN token auth |
| Managed encryption keys | SSE-S3 (default) | Google-managed keys (default) | Azure Storage Service Encryption (default) |
| Customer-managed keys | SSE-KMS (AWS KMS) | Cloud KMS | Azure Key Vault |
| HSM | AWS CloudHSM | Cloud HSM | Azure Dedicated HSM |
| Serverless trigger on upload | S3 → Lambda | GCS → Cloud Functions | Blob Storage → Azure Functions |
| Message queue | SQS | Cloud Pub/Sub | Azure Service Bus |
The patterns covered in this guide — presigned URLs, two-bucket strategy, signed CDN URLs, multipart upload — are identical across all three. Only the API names differ.
Cross-cloud comparison — object storage itself:
| Amazon S3 | Google Cloud Storage | Azure Blob Storage | Cloudflare R2 | Self-hosted (MinIO) | |
|---|---|---|---|---|---|
| Egress cost | ~$0.09/GB | ~$0.08/GB | ~$0.087/GB | Free | Infrastructure only |
| Latency | 10–100ms | 10–100ms | 10–100ms | Similar to S3 | Depends on hardware |
| Scale | Unlimited | Unlimited | Unlimited | Unlimited | Limited by hardware |
| Ecosystem | Massive | Strong (GCP native) | Strong (Azure native) | S3-compatible API | S3-compatible API |
| Use when | AWS ecosystem | GCP ecosystem | Azure ecosystem | Egress cost is a concern | On-prem / compliance |
Rule of thumb: Use whichever matches your cloud ecosystem — S3 on AWS, GCS on GCP, Blob Storage on Azure. Switch to Cloudflare R2 if egress costs are significant regardless of cloud. Self-host only for compliance or data residency requirements.
Search
Match your search solution to dataset size and operational tolerance — zero infra for small datasets, Elasticsearch for billions of documents.
| PostgreSQL Full-Text | Elasticsearch | Typesense / Meilisearch | |
|---|---|---|---|
| Setup | Zero — already in your DB | Heavy — separate cluster | Lightweight |
| Scale | Up to ~10M documents comfortably | Billions of documents | Millions of documents |
| Fuzzy / typo tolerance | No | Yes (edit distance) | Yes (built-in) |
| Relevance tuning | Basic | Highly configurable | Good defaults, less configurable |
| Latency | 50–200ms | 5–50ms | 1–10ms |
| Built-in caching | No | Yes (filter cache + request cache) | Limited |
| Operational cost | None | High (JVM, cluster management) | Low |
When to use each:
- PostgreSQL FTS: Search on < 10M rows, exact-word matching is acceptable, no typo tolerance needed. Already on Postgres — zero extra infra. Use
tsvector+ GIN index;LIKE '%keyword%'forces a full table scan and should never be used at scale. - Elasticsearch: Complex search, fuzzy matching, faceted filtering, log analytics, autocomplete at scale. Worth the ops cost.
- Typesense / Meilisearch: Fast autocomplete/search for smaller datasets, typo tolerance out of the box, much simpler than Elasticsearch.
Keeping Elasticsearch in sync with PostgreSQL:
The hardest part of adding Elasticsearch is keeping it consistent with your source-of-truth DB. Options ranked best to worst:
| Approach | Lag | Risk | Verdict |
|---|---|---|---|
| CDC (Debezium) + direct to ES | ~seconds | Low — captures WAL-level changes including non-app writes | Best default |
| CDC (Debezium) + Kafka + ES consumer | ~seconds | Very low — Kafka buffers; ES consumer can catch up after lag | Best if Kafka already in stack |
| Scheduled batch sync | Minutes | Misses hard deletes cleanly; high latency | Acceptable only for non-real-time search |
| Dual write (app writes both) | ~ms | Partial failure risk — DB write succeeds, ES write fails → silent divergence | Avoid |
CDC wins because it captures changes from anywhere — DB migrations, admin scripts, other services — not just your application code. Debezium reads the PostgreSQL WAL, publishes change events, and an ES consumer indexes them. Near real-time with no application code changes required.
Quick Reference — Decision Summary
One row per requirement — the fastest path from what you need to which technology to name in an interview.
| Need | Pick |
|---|---|
| General relational data, ACID, complex queries | PostgreSQL |
| Horizontal SQL scale, multi-region | CockroachDB |
| High write throughput, time-ordered, massive scale | Cassandra |
| Simple key-value, AWS, auto-scale | DynamoDB |
| Flexible nested documents | MongoDB |
| Cache + data structures + real-time | Redis |
| Time-series metrics / IoT | InfluxDB or TimescaleDB |
| High-throughput event stream, replay, multiple consumers | Kafka |
| Simple async tasks, managed, AWS | SQS |
| Complex message routing, priority | RabbitMQ |
| Real-time broadcast, no durability needed | Redis Pub/Sub |
| File / image / video storage | S3 (default) |
| Full-text search, large dataset | Elasticsearch |
| Fast autocomplete, simple setup | Typesense |
| Web / API traffic load balancing (AWS) | AWS ALB |
| Ultra-low latency, UDP, static IP | AWS NLB |
| Self-managed load balancing | Nginx or HAProxy |
| Public API, auth + rate limiting, AWS | AWS API Gateway |
| High request volume, plugin ecosystem | Kong |
| Auth + rate limiting, already using Nginx | Nginx (as gateway) |
| Long-running workflow, human delays, fault-tolerant | Temporal |
| Simple AWS-native workflow, Lambda orchestration | AWS Step Functions |