How do you prepare for an end-to-end ML system design interview?

Updated June 18, 2026 · 9 min read · Crack ML Interview

TL;DR

ML system design interviews differ from generic system design by adding the full ML lifecycle on top of infrastructure: framing the problem as an ML task, defining online and offline metrics, designing the data and feature pipeline, choosing a model and training strategy, then serving and monitoring it. Use a seven-stage framework, clarify requirements and scale first, and always address training-serving skew, data leakage, and feedback loops, the three failure modes interviewers probe most. Strong candidates connect the business objective to a proxy ML metric and discuss how they would detect model degradation in production.

The Seven-Stage ML System Design Framework

Stages one to three: problem framing, metrics, and data

Begin by translating the product goal into a concrete ML task: is it classification, ranking, regression, or retrieval? Then define metrics on two axes. Offline metrics like AUC, NDCG, or RMSE measure model quality during development, while online metrics like click-through rate, watch time, or conversion measure real impact in an A/B test. Articulate the gap between them explicitly, since interviewers reward candidates who know that a model can improve offline AUC yet hurt the online metric. Next, design the data layer: what raw events you log, how you build a labeled training set, how you avoid label leakage, and how you handle delayed or implicit labels.

Stages four to five: features and model

Design the feature pipeline with a clear split between batch features computed offline and real-time features computed at request time, served through a feature store that guarantees point-in-time correctness so training features match what will be available at serving. For the model, start simple: propose a strong baseline like gradient-boosted trees or logistic regression before reaching for deep models, and justify the choice by data volume, latency budget, and interpretability needs. Explain how you would split train, validation, and test sets respecting time ordering to avoid leakage, and how you would tune and select the final model.

Stages six to seven: serving and monitoring

For serving, specify whether predictions are precomputed in batch and cached, or computed online per request, and justify it against the latency SLA. Describe the model serving infrastructure, candidate generation versus ranking stages for large catalogs, and how you would handle the cold-start problem for new users or items. For monitoring, define what you alert on: prediction distribution drift, feature distribution drift, the online metric, and model staleness. Close the loop by explaining your retraining cadence and how you would run a safe rollout with shadow mode and gradual A/B ramp.

Worked Example: Design a Video Recommendation System

Frame, scope, and choose a two-stage architecture

Clarify scale first: hundreds of millions of users, a catalog of millions of videos, and a latency budget under 100ms for the ranking call. Frame the task as ranking candidate videos by predicted watch time or engagement. Propose the standard two-stage architecture: a candidate generation stage that narrows millions of items to a few hundred using cheap retrieval like a two-tower embedding model with approximate nearest neighbor search, followed by a ranking stage that scores those hundreds with a richer model using more features. This two-stage pattern is the expected answer for any large-catalog recommendation question and shows you understand the latency-quality tradeoff.

Address cold start, feedback loops, and position bias

Cold start for new users uses popularity-based and demographic features until enough interaction history accumulates; for new items, content features and exploration via controlled randomization. The most important advanced point is the feedback loop: the model is trained on data generated by the previous model, so it can amplify its own biases and create filter bubbles. Mitigate with exploration, diversity-aware ranking, and unbiased evaluation. Position bias, where users click higher-ranked items regardless of relevance, must be corrected during training using techniques like inverse propensity weighting, which interviewers at recommendation-heavy companies specifically probe.

The Three Failure Modes Interviewers Always Probe

Training-serving skew and data leakage

Training-serving skew occurs when features are computed differently in training than in production, for example using a value that is only knowable after the prediction time. Prevent it with a shared feature pipeline and a feature store enforcing point-in-time correctness. Data leakage is the silent killer of ML system design answers: any feature derived from information not available at prediction time inflates offline metrics and collapses in production. Proactively naming and preventing both signals real production experience and is one of the fastest ways to move from a borderline to a strong rating.

Model degradation and the retraining loop

Models decay as the world shifts. Concept drift changes the relationship between features and labels; data drift changes the input distribution. Describe a monitoring system that tracks both, alerts when the online metric or prediction distribution moves, and triggers retraining on a schedule or based on drift detection. Discuss safe deployment with shadow evaluation, canary rollout, and automatic rollback. Candidates who treat the model as a one-time artifact rather than a continuously maintained system consistently lose points at senior levels.

ML System Design vs. Generic System Design: What Changes

Stage	Generic System Design	ML System Design Addition	Interviewer Focus
Requirements	Scale, latency, availability	ML task framing, online vs offline metrics	Business-to-metric translation
Data	Storage and schema	Label generation, leakage prevention	Point-in-time correctness
Features	N/A	Batch vs real-time features, feature store	Training-serving skew
Model	N/A	Baseline first, model selection, train/val/test split	Justified simplicity
Serving	Stateless services, caching	Batch vs online prediction, two-stage ranking	Latency-quality tradeoff
Monitoring	Latency and error rate	Drift detection, retraining, A/B rollout	Model degradation handling

Who this is for

Backend engineer strong on infra but new to the ML lifecycle

Profile: Designs scalable distributed systems confidently, comfortable with queues, caches, and databases, but has never owned a model in production or built a training pipeline.

Pain points: Designs an excellent serving system but treats the model as a black box, omitting metrics definition, feature pipelines, leakage prevention, and retraining, which interviewers flag as missing the ML half of the question.

Strategy: Memorize the seven-stage framework and force yourself to spend the first half of every practice answer on framing, metrics, data, and features before touching infrastructure. Study one or two canonical designs end to end, such as recommendation and fraud detection, until the ML lifecycle stages become reflexive.

Applied scientist who models well but skips infrastructure

Profile: Builds and evaluates models day to day, fluent in metrics and feature engineering, but has limited exposure to serving infrastructure, two-stage architectures, and production rollout.

Pain points: Goes deep on model choice and offline metrics but gives thin answers on serving latency, candidate generation, monitoring, and safe deployment, leaving the system half underdeveloped.

Strategy: Explicitly study the infrastructure stages: two-stage ranking, feature store design, online versus batch serving, and A/B rollout mechanics. Practice allocating time so serving and monitoring each get genuine coverage. Lead with modeling strength, then demonstrate you can productionize it end to end.

FAQ

Q: How is an ML system design interview different from a regular system design interview?

A: A regular system design interview tests distributed systems: scaling, storage, caching, and availability. An ML system design interview adds the full model lifecycle on top: task framing, metric definition, data and feature pipelines, model selection, serving strategy, and monitoring with retraining. You need both the infrastructure and the ML lifecycle to score well.

Q: Should I propose a deep learning model or a simple baseline first?

A: Always propose a strong simple baseline first, such as gradient-boosted trees or logistic regression, and justify it by data volume and latency. Then explain when you would escalate to deep models. Reaching immediately for a complex neural architecture without justifying it is a common red flag that signals inexperience.

Q: What is the single most overlooked topic in ML system design answers?

A: Data leakage and training-serving skew. Many candidates design a clean pipeline but never address whether features are available at prediction time or whether training features match serving features. Proactively naming and preventing leakage and skew is one of the fastest ways to stand out.

Want to practice with real, verified ML interview questions from top companies?

Browse the question bank