Home / System Design / Non-Functional Requirements

Non-Functional Requirements

Don't draw boxes until you know what the system demands. For each NFR this doc covers what it means, how the answer changes your architecture layer by layer, key terms, and which real systems make it their top priority. Pick the ones most relevant to the system and let them drive your design.


Scale

How big is the system, and where does the load actually hit? Scale affects every layer — not just the database.

Ask:

  • How many daily active users?
  • What's the read/write ratio?
  • Any bursty traffic patterns (holidays, events)?

DAU → QPS conversion:

A day has 86,400 seconds. In interviews, round that to 100,000 — the 16% error is irrelevant at estimation scale and the mental math becomes trivial.

QPS = DAU × requests_per_user_per_day ÷ 100,000

Worked example with 1M DAU and 10 requests/user/day:

QPS = 1,000,000 × 10 ÷ 100,000
    = 10,000,000 ÷ 100,000
    = 100 QPS

For peak QPS, multiply by 2–3× (traffic isn't uniform — mornings and evenings are heavier):

Peak QPS = 100 × 3 = 300 QPS
DAU QPS (est.) Layer-by-Layer Impact
10K ~1 Single server handles everything. No LB, no replicas, no cache needed. (AWS t3.micro, GCP e2-micro — handles thousands of QPS, overkill at 1 QPS)
100K ~10 Add LB for redundancy (not load — 10 QPS is trivial). Redis cache if queries are expensive. CDN for static assets. (AWS: ALB + 2× t3.small. GCP: Cloud Load Balancing + 2× e2-small)
1M ~100 Multiple app servers behind LB. DB read replicas (1–2). Connection pooler to avoid exhausting DB connections. (AWS: RDS PostgreSQL — up to 5 read replicas, RDS Proxy for connection pooling. GCP: Cloud SQL PostgreSQL — up to 10 read replicas, Cloud SQL Proxy for connection pooling)
10M ~1,000 Kafka for async writes. Redis Cluster for cache. DB read replicas still sufficient for reads — don't shard yet. (AWS: Aurora — up to 15 read replicas, handles 100K+ reads/sec, Aurora Serverless v2 auto-scales. GCP: AlloyDB for PostgreSQL — up to 16 read pool nodes, auto-scales. Sharding threshold is 10K–50K QPS or when data volume outgrows a single host)
100M+ ~10,000+ Multi-region everything. Global LB. DB sharding or distributed SQL. Connection pooling critical at this tier. CDN serves 80%+ of traffic. (AWS: Aurora Global Database, RDS Proxy, CloudFront + Route53 latency routing. GCP: Cloud Spanner for distributed SQL, Cloud SQL Proxy, Cloud CDN + global anycast Cloud Load Balancing. Cloud-agnostic: CockroachDB, Cloudflare CDN)

How scale changes tech choices per layer:

  • App servers: Single instance → horizontal auto-scaling group → stateless containers (AWS ECS/EKS, GCP Cloud Run/GKE, or self-managed K8s)
  • Database: PostgreSQL single → read replicas → sharding → NoSQL or distributed SQL
  • Cache: No cache → Redis single → Redis Cluster (partitioned across nodes)
  • LB: Not needed → single regional LB (AWS ALB, GCP Cloud Load Balancing) → multi-AZ → global load balancing (AWS Route53 latency routing, GCP global anycast Cloud Load Balancing)
  • CDN: Not needed → CDN for static (AWS CloudFront, GCP Cloud CDN, or Cloudflare) → full edge caching with dynamic content

Read/Write ratio shapes your architecture:

  • Read-heavy (100:1) → cache aggressively (Redis, CDN), DB read replicas. Twitter feed, Reddit homepage.
  • Write-heavy (1:10) → message queues (Kafka) to absorb bursts, append-only logs, async consumers. Consider CQRS (separate write model from read model) to prevent reads from competing with writes. Logging pipeline, analytics ingestion.
  • Balanced → general-purpose horizontal scaling.

Burst traffic: If traffic spikes at predictable times (Black Friday, live events), design for auto-scaling and queue-based buffering, not steady-state peak capacity.

Storage estimate: DAU × avg_event_size × events_per_day × retention_days

Most critical for: Twitter/X (read-heavy feed), YouTube (video storage + CDN), Uber (surge traffic), ticketing systems (flash sales).


Latency

How fast must the system respond? This determines where you place compute, what stays synchronous, and what you offload.

Ask:

  • What's the acceptable p99 response time?
  • Are there specific operations that must be fast?

What is p99? Sort all requests by response time. P99 is the 99th one — 99% of requests completed faster than this. P50 is the median. P99.9 is 1 in 1,000. At 1,000 QPS, p99 means 10 requests every second are slower than that number.

Target What It Means Design Impact
< 10ms Ultra-low. Real-time systems. Data must live in-process memory or local Redis. No network hops. Compute co-located with data.
< 100ms Feels instant to users. Read from Redis cache (same DC, 1ms), not DB (10ms). CDN serves assets from edge node 5ms away, not 60ms to origin. Precompute results offline.
< 500ms Interactive. Standard web UX. Cache reads from Redis. Async writes (publish to queue, return 200). DB reads must hit indexes.
1–5s Tolerable for complex queries. Background jobs for heavy computation. DB aggregations OK if indexed. Show loading states.
> 5s Batch is fine. Async processing, queues, offline jobs. No sync response needed.

Why p99 and not average — average hides pain: 99 requests complete in 50ms. One hangs for 10,000ms.

Average = (99×50 + 10,000) / 100 = 149ms  ← looks healthy
P99     = 10,000ms                          ← system is on fire

At Walmart's scale — 260 million customers a week — 1% is 2.6 million people experiencing that hang. Average would never surface it.

Why p99 specifically:

Percentile What it catches Problem
P50 Half your users Too lenient — misses most of the tail
P95 Most issues Can miss slow-building degradation
P99 Tail latency — industry standard Sensitive without being noisy
P100 Single worst request Always an outlier — useless for decisions

P99 is the SLA standard because it catches real problems early without false-alarming on one-off outliers.

P99 spike = early overload warning: When a system gets busy, it fails at the tail first — not uniformly.

State P50 P99
Healthy 40ms 80ms
Getting busy 45ms 400ms
Overloaded 80ms 2,500ms
Crashed timeout timeout

By the time P50 degrades, you're already in serious trouble. P99 gives you the window to act — shed load, scale out, open a circuit — before the majority of users feel it.

Setting the threshold — from your SLA, not arbitrary: Set the shedding trigger at a fraction of your SLA, not after you've already breached it.

Inventory reservation SLA = 500ms → shed when p99 > 400ms  (80% of SLA)
Homepage SLA = 2,000ms           → shed when p99 > 1,600ms (80% of SLA)

The gap between threshold and SLA is your response window. Trigger too late and you're already breaking the promise.

How it's measured: Rolling window over the last N seconds (typically 10s), recomputed every 5 seconds. In practice, use Micrometer or Prometheus histograms rather than sorting raw request lists — same concept, far more efficient.

Latency vs throughput: Latency is how fast one request completes. Throughput is how many requests per second the system handles. Optimizing for one can hurt the other — batching increases throughput but adds per-request latency. Know which the interviewer cares about.

p99 matters more than average. If the interviewer says "users complain it feels slow", think tail latency, not mean. One unindexed query or missing cache can blow your p99 while your average looks fine.

Per-component latency costs: For exact numbers per hop (LB, Redis, DB, Kafka, S3, etc.) and worked end-to-end request path examples, see the Latency Reference Table in System Design Layers.

Most critical for: Search/autocomplete (Yelp, Google — < 100ms), stock trading (HFT — microseconds), multiplayer gaming, ride-matching (Uber — driver must get request fast).


Availability

How much downtime is acceptable? This drives redundancy, replication topology, and failover strategy across all layers.

Ask:

  • What's the uptime requirement?
  • What happens to users if this goes down?
SLA Downtime/Year Downtime/Month What It Looks Like Across Layers
99% ~3.6 days ~7.2 hours Single region. Single DB. Basic health check restarts crashed app server.
99.9% ~8.7 hours ~43 min Multi-AZ LB. 2+ app server instances. DB with automatic failover replica (AWS RDS Multi-AZ ~60s switchover, GCP Cloud SQL HA ~60s switchover). Redis Sentinel for cache failover.
99.99% ~52 min ~4.3 min Active-active across 2 regions. DB replication across regions with < 1min failover. CDN absorbs traffic if origin partially down. App servers auto-scale and auto-replace.
99.999% ~5 min ~26 sec No single point of failure anywhere. LB: multiple active nodes. App: blue/green deploys with instant rollback. DB: synchronous multi-region replication (AWS Aurora Global, GCP Spanner, or CockroachDB). Cache: Redis Cluster across AZs. Queue: Kafka with replication factor 3.

How each layer achieves redundancy:

  • Load Balancer: Health checks drop unhealthy app servers from rotation in seconds. Multi-AZ deployment so one AZ outage doesn't take the LB down.
  • App Servers: Stateless (no local state) so any instance can handle any request. Auto-scaling group replaces failed instances automatically.
  • Cache (Redis): Redis Sentinel (monitors, auto-promotes replica on primary failure). Redis Cluster (shards + replicas, handles node loss).
  • Database: Primary + read replica → automatic failover on primary crash. Multi-region → async or sync replication depending on consistency needs.
  • Message Queue: Kafka replication factor ≥ 3 means 2 broker deaths don't lose messages.
  • CDN: CDN providers are globally redundant by design (AWS CloudFront, GCP Cloud CDN, Cloudflare).

CAP Theorem tradeoff: During a network partition, choose availability (keep serving, possibly stale) or consistency (stop serving until consistent). Most consumer apps choose availability. Payment systems choose consistency.

Graceful degradation: Netflix shows cached thumbnails when recommendations service is down. Don't fail completely — fail partially.

SLI / SLO / SLA — know the difference:

  • SLI (Service Level Indicator) — the actual measured metric. E.g., "97.8% of requests completed in < 200ms this week."
  • SLO (Service Level Objective) — your internal target. E.g., "99.9% of requests must complete in < 200ms." This is what your team is held to.
  • SLA (Service Level Agreement) — the contractual commitment to customers, with financial penalties for breach. Always looser than your SLO (you'd go out of business otherwise).

Error budget: 1 − SLO. A 99.9% SLO gives you ~43 min/month to spend on incidents and deploys. When the budget is gone, freeze non-critical changes until the window resets.

Each extra 9 is roughly 10× harder and more expensive. Push back if the requirement seems over-engineered.

Most critical for: Payment processors (Stripe, Visa — 99.999%), AWS infrastructure, healthcare systems, any system where downtime = revenue loss or safety risk.


Consistency

When a write happens, when do all nodes and users see it? This is the core CAP tradeoff in practice.

Ask:

  • Can users see slightly stale data?
  • If two users write at the same time, does it matter which one wins?
Model What It Means When to Use Real Example
Strong (Linearizable) All reads see the latest write immediately across all nodes. Each operation appears to take effect at a single instant in time. Payments, inventory, bank balances PostgreSQL, Zookeeper, Spanner
Read-your-writes You always see your own latest write. Other users may lag briefly. Profile updates, settings Most social apps for own data
Eventual Writes propagate across nodes eventually. Briefly stale OK. Social feeds, like counts, view counts Cassandra, DynamoDB default, DNS
Causal Causally related writes are seen in order by all nodes. Unrelated writes can appear in any order. Comments/replies, collaborative editing, chat MongoDB sessions, DynamoDB transactions

ACID vs BASE: Relational databases give you ACID (Atomic, Consistent, Isolated, Durable) — all or nothing, always correct. Most NoSQL databases give you BASE (Basically Available, Soft state, Eventually consistent) — always up, eventually right. Choosing a DB is often choosing between these two philosophies.

Idempotency: A write operation is idempotent if calling it multiple times produces the same result as calling it once. Critical when clients retry on network failure — without it, a retried payment charges the user twice.

Causal consistency explained: If Alice posts "I'm going to the store" and Bob replies "I'll come with you", causal consistency guarantees Carol always sees Alice's post before Bob's reply — because Bob's reply causally depends on Alice's post. But Carol might see Alice's post before or after Dave's unrelated status update — that's fine, they're not causally linked.

This is stronger than eventual (which could show Bob's reply before Alice's post) but weaker than strong (which globally orders every single write). It's the right choice when order matters within a thread or conversation, but not globally.

Interview signal: "It's fine if the like count is off by a few seconds" → eventual consistency, scale horizontally. "Double-charging a user is unacceptable" → strong consistency, accept the latency cost.

Most critical for: Banking and payments (double-spend prevention), inventory systems (Amazon — can't oversell), booking systems (airline seats, hotel rooms).


Idempotency

If a client retries a request, will it cause duplicate side effects? This shapes how you design APIs and payment flows.

Ask:

  • Can clients retry failed requests safely?
  • Are there operations where duplicates are catastrophic (charges, transfers, order submissions)?

The problem: A client sends a payment request. The server processes it, but the response is lost in transit. The client retries. Without idempotency, the user gets charged twice.

The solution — idempotency keys: Client generates a unique key per logical operation (e.g., UUID) and sends it with the request. Server stores (idempotency_key → result) in Redis or DB. On duplicate request: return the stored result, skip re-execution.

Operation Type Naturally Idempotent? Fix
GET, DELETE Yes (GET reads, DELETE on missing is no-op) Nothing needed
PUT (replace entire resource) Yes Nothing needed
POST (create, charge, transfer) No Add idempotency key
Message consumer processing No Track processed message IDs in DB

Idempotency in queues: A Kafka consumer that crashes mid-processing will re-receive the same message. Design consumers to be idempotent — check if the event was already processed (by storing the event ID) before acting on it.

Most critical for: Payment APIs (Stripe uses idempotency keys on every charge endpoint), order submission, booking systems, any POST that creates or transfers.


Durability

How much data loss is acceptable if the system crashes or a node goes down?

Ask:

  • If we lose a server right now, what's the worst acceptable outcome?
  • Can we replay events from a log?

RPO = Recovery Point Objective — how much data can we lose? Measured in time (0ms, 1s, 1hr means you lose up to that much data). RTO = Recovery Time Objective — how long can the system be down during recovery?

Term Definition Design Impact Latency Cost
RPO = 0 Zero data loss. Synchronous replication: primary waits for all replicas to confirm before acking the write. +5–20ms per write at DB layer. +10–50ms if two-phase commit across services.
RPO = seconds Tiny loss OK. Async replication. WAL (write-ahead log) shipped to replica continuously. No extra latency — write acks immediately, replication happens in background.
RPO = hours Some loss tolerable. Periodic snapshots or nightly backups. No latency impact.
RTO = seconds Must recover near-instantly. Hot standby replica already running, promoted automatically on failure (~30–60s). No latency impact on normal path.
RTO = minutes Fast recovery needed. Warm standby: replica exists but not serving traffic. Promoted manually or semi-auto.
RTO = hours Slower recovery OK. Restore from backup. Spin up new instance.

RPO=0 adds latency at every layer that writes:

  • DB: Synchronous replication adds 5–20ms (waiting for replica in another AZ to confirm).
  • App layer: If using distributed transactions (two-phase commit), add 10–50ms.
  • Queue: Kafka with acks=all (wait for all in-sync replicas) adds 2–5ms vs acks=1.

This is the direct tradeoff: stronger durability = higher write latency.

Real examples:

  • Banking: RPO = 0. Every transaction written synchronously to multiple replicas before confirmation.
  • Social media posts: RPO = seconds is fine. Async replication acceptable.
  • Object storage: 11 nines of durability via cross-AZ redundant storage (AWS S3, GCP Cloud Storage).

Most critical for: Banking and financial systems, medical records (Epic, FHIR), legal document storage, payment transaction logs — any system where lost data = legal or financial liability.


Fault Tolerance

How well does the system handle partial failures without going fully down?

Ask:

  • What happens when one server crashes?
  • What happens when a whole datacenter goes down?
  • What if a dependency is slow or unavailable?
Failure Type Strategy Example
Single node crash Redundant replicas, auto-failover DB primary/replica, load balancer health checks
Slow dependency Timeouts + circuit breaker Stop calling a failing service; return fallback
Datacenter outage Multi-AZ or multi-region active-active Route traffic to surviving region
Data corruption Checksums, write-ahead logs, point-in-time restore Detect and roll back bad writes
Cascading failures Bulkheads (isolate failure domains), rate limiting Don't let one slow service take down everything

Circuit breaker pattern: If a downstream service fails N times in a row, stop calling it for a period. Return a cached/default response. Let the dependency recover before retrying.

Retry with exponential backoff + jitter: When a request fails, wait before retrying — and double the wait each attempt (backoff). Add random jitter so all retrying clients don't slam the service at the same moment (thundering herd). A common sequence: retry after 1s, 2s, 4s, 8s with ±30% jitter, then give up and dead-letter.

Dead Letter Queue (DLQ): When a message fails processing repeatedly (after N retries), route it to a DLQ instead of blocking the queue. The DLQ holds poisoned messages for inspection and manual replay. Without a DLQ, one bad message can stall an entire consumer group indefinitely. (AWS SQS dead-letter queues, GCP Pub/Sub dead-letter topics, or a separate Kafka topic).

Most critical for: Microservices architectures (each service can fail independently), distributed databases, any system with SLA > 99.9%.


Security

What data does the system handle and who should access it? Drives auth, encryption, and regulatory design.

Ask:

  • Does this handle PII, payments, or health data?
  • Who are the users — public, internal, B2B?

Key terms:

  • PII (Personally Identifiable Information) — any data that can identify a person: name, email, phone, SSN, IP address. Triggers GDPR/HIPAA obligations.
  • TLS (Transport Layer Security) — encrypts data in transit (the "S" in HTTPS). Prevents interception.
  • AES-256 (Advanced Encryption Standard) — standard algorithm for encrypting data at rest. Used in S3, databases, filesystems.
  • JWT (JSON Web Token) — signed token the client sends with each request to prove identity. Stateless, server doesn't store sessions.
  • OAuth2 — standard for delegated auth. "Sign in with Google" is OAuth2. Separates identity from your app.
  • mTLS (Mutual TLS) — both sides verify certificates. Used for service-to-service auth inside your system.
  • RBAC (Role-Based Access Control) — users get roles (admin, editor, viewer), roles get permissions. Simpler than per-user rules.
  • ACL (Access Control List) — per-resource list of who can do what. More granular than RBAC (e.g. S3 bucket policies).

Security by layer:

Layer What Goes Here
CDN DDoS protection, WAF (Web Application Firewall) blocks malicious requests before they reach origin
Load Balancer TLS termination (decrypt HTTPS here, forward HTTP internally), IP whitelisting
API Gateway Authentication (verify JWT/OAuth token), rate limiting (token bucket), request validation
App Server Authorization (RBAC checks — "can this user do this action?"), input validation, business logic security
Cache (Redis) Don't cache raw PII if avoidable. Redis AUTH password. Encrypt sensitive values if stored.
Database AES-256 encryption at rest. Row-level security for multi-tenant data. Least-privilege DB users. Audit log here — append-only table logging who accessed what and when.
Object Storage Signed URLs for private files (time-limited access). Bucket policies. Server-side encryption.

Most critical for: Healthcare (HIPAA — audit every access to patient records), financial systems (PCI-DSS — card data tokenized immediately), auth systems (OAuth provider), any multi-tenant SaaS.


Compliance

Are there legal or regulatory constraints that shape the architecture?

Ask:

  • What region are users in?
  • Does this handle health, financial, or personal data?

Key terms:

  • GDPR (General Data Protection Regulation) — EU law. Applies to any system with EU users, regardless of where the company is located.
  • HIPAA (Health Insurance Portability and Accountability Act) — US law governing health data. Applies to any app handling patient records.
  • PCI-DSS (Payment Card Industry Data Security Standard) — required for any system that stores, processes, or transmits card data.
  • SOC 2 — US auditing standard for SaaS companies. Type I = point-in-time assessment. Type II = 6 months of continuous evidence. Required by enterprise buyers.
Regulation Who It Affects Key Architecture Constraint
GDPR (EU) Any system with EU users Data residency in EU. Right to delete (complicates append-only logs). Breach notification in 72hrs.
HIPAA (US healthcare) Medical records, health apps Audit log every data access. Encryption in transit and at rest. Business associate agreements with vendors.
PCI-DSS (payments) Any system touching card data Card data never stored raw — tokenize immediately on receipt. Annual third-party audits. Network segmentation.
SOC 2 B2B SaaS Documented security controls. Access reviews. Incident response plan.

GDPR complicates event-sourcing: Append-only logs make "right to delete" hard — you can't erase a past event. Solve with tombstone records or keep PII in a separate deletable store and only store user IDs in the event log.

Most critical for: Healthcare apps, payment processors, social platforms with EU users, any enterprise B2B SaaS sold to regulated industries.


Monitoring & Observability

How do you know the system is healthy in production? Drives logging, metrics, and alerting design.

Ask:

  • Do you need real-time alerting?
  • How quickly must the team detect and diagnose production issues?
Signal What It Covers Tools
Metrics QPS, latency, error rate, CPU/memory/disk Prometheus, Datadog, AWS CloudWatch, GCP Cloud Monitoring
Logs What happened and in what order ELK stack, Splunk, AWS CloudWatch Logs, GCP Cloud Logging
Traces Where time was spent across services Jaeger, Zipkin, AWS X-Ray, GCP Cloud Trace
Alerts Notify when SLA is breached PagerDuty, Opsgenie

The four golden signals (Google SRE): Latency, Traffic, Errors, Saturation. Build monitoring around these first.

  • Latency — how long requests take (track p99, not average)
  • Traffic — how much load the system is under (QPS, requests/sec)
  • Errors — rate of failed requests (5xx errors, timeouts, exceptions)
  • Saturation — how "full" a resource is. A CPU at 95% is saturated. A disk at 98% capacity is saturated. Saturation predicts future failure — a resource approaching 100% will soon become a bottleneck and cause latency spikes or crashes. Monitor: CPU %, memory %, disk I/O utilization, DB connection pool usage, queue depth.

Most critical for: Any system with a strict SLA, microservices (failures are hard to trace), Netflix-style chaos engineering, financial systems where bugs cost real money.


Environment Constraints

Are there non-standard constraints on the environment the system runs in?

Ask:

  • Are clients on mobile or constrained devices?
  • Are there low-bandwidth or offline scenarios to handle?
Constraint Design Impact
Mobile clients Minimize payload size. Compress responses. Offline-first with local cache.
Low bandwidth (3G/rural) Adaptive bitrate streaming (YouTube, Netflix). Delta sync instead of full sync.
Limited battery Batch network calls. Avoid polling — use push (WebSockets, FCM).
Edge/IoT devices Lightweight protocols (MQTT). Local processing before cloud sync.
Offline-first Local DB (SQLite), sync on reconnect, conflict resolution strategy.

Most critical for: Uber driver app (poor network in some cities), Google Maps offline, WhatsApp (works on 2G), IoT sensor pipelines, healthcare apps in hospitals with restricted networks.


Quick Reference — Which NFR Matters Most

System Top NFRs to Prioritize
Banking / payments Consistency, Idempotency, Durability, Security, Compliance
Social feed (Twitter, Instagram) Scale, Availability, Latency
Healthcare records Durability, Security, Compliance, Availability
Search / autocomplete (Yelp, Google) Latency, Scale
Ride-sharing (Uber) Availability, Latency, Fault Tolerance, Environment
Video streaming (Netflix, YouTube) Scale, Availability, Latency, Environment
Chat / messaging (WhatsApp) Availability, Durability, Environment
Ticketing / booking (Airbnb, airlines) Consistency, Availability, Scale
Enterprise SaaS Security, Compliance, Availability
IoT / sensor pipeline Scale, Fault Tolerance, Environment, Durability