System Design Interview Roadmap

S
Sarah Architect
Nov 8, 2023
20 min read
System Design Interview Roadmap

System Design Interview Roadmap for Top Tech Companies

For software engineering roles at the Mid-Level (L4), Senior (L5), and Staff (L6+) levels, the System Design Interview (SDI) is the single most critical factor in determining not just if you get the job, but what level and how much equity you are offered.

Unlike Data Structures and Algorithms (DSA) interviews, which have definitive right and wrong answers, System Design interviews are famously ambiguous. You are given a blank whiteboard and an incredibly vague prompt—like "Design Twitter" or "Design a Rate Limiter"—and you have 45 minutes to design a highly scalable, fault-tolerant, and performant architecture.

Because of this ambiguity, many candidates struggle to prepare. Reading random engineering blogs is not enough. You need a structured, step-by-step roadmap to master the building blocks of distributed systems.

This massive 2,500+ word guide is that roadmap. We will cover the prerequisites you must study, the standard 45-minute interview framework, and deep-dives into databases, caching, message queues, and consensus algorithms.


Phase 1: Prerequisites & Building Blocks

Before you even attempt to design a full system, you must have a rock-solid understanding of the individual components that make up a distributed architecture.

1. Network Protocols and the Internet

You need to understand how computers talk to each other.

  • TCP/IP: Understand the 3-way handshake. Why is TCP reliable but slower than UDP?
  • UDP: Why do video streaming or real-time gaming services prefer UDP over TCP?
  • HTTP/REST: Understand standard verbs (GET, POST, PUT, DELETE) and status codes.
  • WebSockets: How do you maintain a persistent connection for a real-time chat application?
  • gRPC: Why are internal microservices increasingly using gRPC over REST?

2. Load Balancing

As traffic increases, a single server will eventually fail. You must scale horizontally (adding more servers). To route traffic across these servers, you need a Load Balancer.

  • Layer 4 vs. Layer 7 Load Balancing: Layer 4 operates at the transport layer (TCP/UDP) and is extremely fast. Layer 7 operates at the application layer (HTTP) and can inspect headers to make smarter routing decisions.
  • Algorithms: Round-robin, least connections, and IP hashing.

3. Databases (SQL vs. NoSQL)

This is the most critical decision in any system design. Choosing the wrong database guarantees failure in a real-world scenario and in an interview.

  • SQL (Relational): PostgreSQL, MySQL. Use when data is highly structured, and you need ACID guarantees (Atomicity, Consistency, Isolation, Durability). Ideal for financial transactions.
  • NoSQL (Non-Relational):
    • Key-Value Stores (Redis, DynamoDB): Lightning-fast reads/writes. Great for session data or caching.
    • Document Stores (MongoDB): Flexible schema. Great for catalogs or user profiles.
    • Wide-Column Stores (Cassandra): Incredible write performance and horizontal scalability. Perfect for time-series data or heavy logging (e.g., "Design a Metrics System").
    • Graph Databases (Neo4j): Designed for navigating relationships. Ideal for "Design Facebook's Social Graph".

4. Caching Strategies

Caching prevents your database from melting down under heavy read loads.

  • Read-Through vs. Write-Through vs. Write-Behind: Understand the trade-offs of when to update the cache versus the database.
  • Eviction Policies: LRU (Least Recently Used) is the most common algorithm for evicting old data when the cache is full.
  • Tools: Redis (single-threaded, supports complex data structures) vs. Memcached (multi-threaded, simpler).

5. Asynchronous Processing & Message Queues

When a user uploads a video to YouTube, they do not wait 15 minutes for the HTTP request to finish processing the video. The request is placed in a message queue, and background workers handle it asynchronously.

  • Tools: Apache Kafka, RabbitMQ, Amazon SQS.
  • Concepts: Pub/Sub models, consumer groups, and "At-Least-Once" vs. "Exactly-Once" delivery semantics.

Phase 2: Understanding Trade-Offs (The "Why")

In a senior interview, the interviewer does not just want to see a box labeled "Database." They will ask you, "Why that specific database? What happens when a node fails?"

The CAP Theorem

You cannot master system design without understanding the CAP Theorem. It states that a distributed data store can only provide two of the following three guarantees simultaneously:

  1. Consistency (C): Every read receives the most recent write or an error.
  2. Availability (A): Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
  3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped by the network.

Because network partitions (P) are unavoidable in distributed systems, you must choose between CP (Consistency and Partition Tolerance) or AP (Availability and Partition Tolerance).

  • Banking system? Choose CP. You would rather reject a transaction than show a false balance.
  • Social media feed? Choose AP. If a user sees a post 5 seconds late, it doesn't matter, as long as the app loads instantly.

Consistency Models

If you choose Availability (AP), your system will likely implement Eventual Consistency. This means that if no new updates are made, eventually all accesses will return the last updated value.

  • Strong Consistency: Slower, requires locking across nodes.
  • Eventual Consistency: Faster, scales globally, but clients might read stale data.

Phase 3: The 45-Minute Interview Framework

You must structure your interview. If you start drawing boxes immediately, you will fail. Follow this framework religiously.

Step 1: Requirements Clarification (5 minutes)

Never assume anything. Prompt: "Design Instagram."

  • Functional Requirements: Can users post videos or just photos? Is there a newsfeed? Can users search for tags?
  • Non-Functional Requirements: How many daily active users (DAU)? Is it read-heavy or write-heavy? (Instagram is extremely read-heavy). What is the expected latency?

Step 2: Back-of-the-Envelope Estimation (5 minutes)

Prove you understand the scale.

  • Traffic: 100 Million DAU. If each user makes 10 reads per day, that is 1 Billion reads/day. Divide by 100,000 seconds in a day = 10,000 Queries Per Second (QPS).
  • Storage: If a user uploads 1 photo a day (1MB), that is 100TB per day. Over 10 years, you need Petabytes of storage. This means a single SQL database will not work.

Step 3: High-Level Design (10-15 minutes)

Draw the core architecture. Keep it simple initially.

  • Client -> Load Balancer -> Web Servers.
  • Web Servers talk to a Cache (Redis) and a Database (PostgreSQL/Cassandra).
  • Media files go to Object Storage (Amazon S3), and a CDN (Cloudflare) caches those images globally.

Step 4: Deep Dive & Bottlenecks (15-20 minutes)

This is where you earn your Senior title. The interviewer will point to your database and say, "10,000 writes per second will crash this database. How do you fix it?"

  • Database Sharding: Explain how you will partition the data across multiple database nodes. Will you shard by User_ID or Photo_ID? What happens to celebrities (Justin Bieber) who get millions of hits on a single photo? You must discuss "Hot Key" issues.
  • Caching Strategies: Explain how you will cache the Newsfeed.
  • Message Queues: Explain how uploading a photo triggers a Kafka event, which background workers consume to generate thumbnails and update follower feeds.

Phase 4: Common System Design Interview Questions to Practice

You should thoroughly practice designing the following systems. They cover almost every pattern you will see in a real interview:

  1. Design a URL Shortener (bit.ly): Teaches you base62 encoding, massive read-heavy scaling, and basic database sharding.
  2. Design Twitter / Instagram: Teaches you the fan-out architecture for newsfeeds (Push vs. Pull models for celebrity accounts).
  3. Design a Chat App (WhatsApp): Teaches you WebSockets, message sequencing, and presence servers (how to show if a user is "Online").
  4. Design a Rate Limiter: Teaches you low-latency distributed algorithms (Token Bucket, Leaky Bucket, Sliding Window) and Redis Lua scripts.
  5. Design a Key-Value Store: A deep dive into database internals, Consistent Hashing, Merkle Trees for anti-entropy, and the Gossip Protocol.

Conclusion & How to Practice

System Design is about communicating trade-offs. The interviewer wants to work with someone who can confidently say, "I am choosing Cassandra because our write throughput is too high for PostgreSQL, but I acknowledge this means we sacrifice strong consistency for our read operations."

Practice Strategy:

  1. Read: Designing Data-Intensive Applications by Martin Kleppmann. It is mandatory reading for Senior engineers.
  2. Watch: Conference talks (InfoQ) on how Netflix, Uber, and Meta scaled their specific microservices.
  3. Simulate: Do mock interviews. Explaining architecture out loud while drawing on a virtual whiteboard is a distinct skill. Use InterviPrep AI to simulate senior-level architectural deep-dives with AI interviewers who will challenge your design choices just like a real FAANG engineer would.
Share this guide: