How do you design a production RAG system in an interview?

Updated June 18, 2026 · 8 min read · Crack ML Interview

TL;DR

A strong RAG design answer covers the full pipeline: ingestion and chunking, embedding and indexing in a vector store, retrieval with hybrid search, reranking to improve top-k precision, prompt assembly with context, generation, and evaluation. The differentiators are knowing that retrieval quality dominates final answer quality, using hybrid dense-plus-keyword search to fix the weaknesses of pure semantic search, adding a cross-encoder reranker, and defining concrete evaluation metrics like context recall, faithfulness, and answer relevance. The most common failure is retrieval miss, so most of your optimization budget should go to retrieval, not the LLM.

The RAG Pipeline End to End

Ingestion, chunking, and embedding

Ingestion parses documents into text, then chunks them into retrievable units. Chunk size is a critical tradeoff: small chunks give precise retrieval but may lack context, while large chunks preserve context but dilute relevance and waste the context window. A common default is a few hundred tokens with overlap to avoid cutting concepts at boundaries, but the right size depends on document structure. Each chunk is embedded with a sentence-embedding model into a dense vector. Discuss embedding model choice by domain: a general model may underperform on specialized text like legal or medical, where a domain-tuned embedder improves recall substantially.

Indexing and retrieval with hybrid search

Store embeddings in a vector database using an approximate nearest neighbor index like HNSW or IVF-PQ, trading a small recall loss for fast sublinear search at scale. The key sophistication is hybrid search: pure dense vector search misses exact keyword matches, rare terms, and identifiers, so combine it with sparse keyword search like BM25 and fuse the results. Hybrid retrieval consistently outperforms either method alone and is the expected answer when an interviewer asks how to improve retrieval quality. Also discuss metadata filtering to scope retrieval to the right document set before similarity search.

Reranking, prompt assembly, and generation

Retrieve a generous top-k of twenty to fifty candidates, then rerank them with a cross-encoder that jointly encodes the query and each chunk to produce a precise relevance score, keeping only the top three to five for the prompt. This two-stage retrieve-then-rerank pattern is far more accurate than relying on the initial similarity scores. Assemble the prompt by injecting the reranked chunks with clear delimiters and an instruction to answer only from the provided context and to cite sources. Finally, generate the answer, ideally with citations so the response is verifiable and hallucinations are easier to catch.

Evaluating a RAG System

Separate retrieval metrics from generation metrics

A common mistake is evaluating only the final answer. Decompose evaluation into retrieval and generation. Retrieval metrics include context recall, whether the chunks needed to answer were retrieved, and context precision, whether retrieved chunks are actually relevant. Generation metrics include faithfulness, whether every claim in the answer is grounded in the retrieved context, and answer relevance, whether the answer addresses the question. Separating these lets you localize failures: low context recall means fix retrieval, while high context recall but low faithfulness means fix the prompt or model.

Build an evaluation set and use LLM-as-judge carefully

Construct a labeled evaluation set of question, ground-truth answer, and ideally ground-truth source chunks. For automated scoring, an LLM judge can grade faithfulness and relevance at scale, but caveat that judges have their own biases and should be validated against human labels on a sample. Track these metrics in CI so a chunking or embedding change cannot silently regress quality. Mentioning a concrete evaluation harness signals that you treat RAG as an engineering system with measurable quality, not a prompt you tweak by vibes.

Failure Modes and Scaling Concerns

Retrieval miss is the dominant failure

The most frequent RAG failure is that the relevant information was never retrieved, so the model either hallucinates or refuses. Causes include poor chunking that splits the answer, an embedding model that does not capture domain semantics, and queries phrased differently from the source text. Fixes include query rewriting or expansion before retrieval, hybrid search, better chunking, and increasing top-k before reranking. Because retrieval quality caps the entire system, allocate most optimization effort there rather than to the generation model.

Freshness, cost, and latency at scale

For a production system, address index freshness: how new or updated documents are re-embedded and upserted, and how stale chunks are removed. Discuss cost and latency, since reranking and large contexts increase both: cache embeddings and frequent query results, and use a smaller model for simple queries with escalation for complex ones. At high query volume, the reranker and the LLM are the cost drivers, so semantic caching of near-duplicate queries provides meaningful savings. Naming these operational concerns separates a toy demo answer from a production design.

RAG Pipeline Stages: Purpose, Key Choices, and Failure Modes

Stage	Purpose	Key Design Choice	Primary Failure Mode
Chunking	Split docs into retrievable units	Chunk size and overlap	Answer split across chunks
Embedding	Map text to vectors	General vs domain-tuned model	Poor domain semantic match
Indexing	Fast similarity search	HNSW / IVF-PQ ANN index	Recall loss from over-aggressive ANN
Retrieval	Find candidate chunks	Hybrid dense + BM25	Retrieval miss on rare terms
Reranking	Precise top-k selection	Cross-encoder reranker	Latency exceeds SLA
Generation	Produce grounded answer	Cite-from-context prompting	Hallucination beyond context
Evaluation	Measure quality	Separate retrieval vs generation metrics	Only scoring final answer

Who this is for

Engineer who built a RAG demo but never productionized one

Profile: Has wired up a basic vector search plus LLM prototype with a framework, but has not dealt with evaluation, reranking, hybrid search, or index freshness.

Pain points: Describes the happy-path pipeline fluently but stalls on how to measure quality, fix retrieval misses, or keep the index fresh, which reveals the gap between a demo and a production system.

Strategy: Study the retrieve-then-rerank pattern and hybrid search until you can justify them, and learn the four core RAG metrics: context recall, context precision, faithfulness, and answer relevance. Practice answering with retrieval quality as the central theme, since that is where production RAG systems live or die.

Backend engineer strong on infra, light on retrieval quality

Profile: Comfortable designing scalable services, caching, and queues, but has limited intuition for chunking, embedding model selection, and retrieval evaluation.

Pain points: Designs a robust serving and caching layer but gives shallow answers on why retrieval fails and how to improve it, treating the embedding and reranking choices as interchangeable.

Strategy: Build intuition for the retrieval half: read about hybrid search, cross-encoder reranking, and chunking tradeoffs. Pair existing infrastructure strength with concrete retrieval-quality reasoning so the answer is both scalable and accurate. Lead with the pipeline, then layer in caching and freshness as operational depth.

FAQ

Q: When should I recommend fine-tuning instead of RAG?

A: RAG is best when knowledge changes frequently, must be cited, or is too large to bake into weights. Fine-tuning is better for teaching style, format, or a fixed skill. In interviews, the strong answer is usually that they are complementary: use RAG for up-to-date factual grounding and fine-tuning for behavior and format, and combine them when both matter.

Q: How do I choose chunk size in a RAG system?

A: There is no universal answer; it depends on document structure and query type. Start with a few hundred tokens with overlap, then tune empirically against your retrieval metrics. Structured documents may chunk by section, while dense prose may need smaller overlapping windows. The right move in an interview is to state the tradeoff and say you would tune it against context recall.

Q: Is a reranker really necessary, or is vector search enough?

A: For anything beyond a small corpus, a reranker meaningfully improves answer quality. Initial vector similarity is approximate and noisy, so retrieving a generous top-k and reranking with a cross-encoder to a small final set captures most of the available accuracy gain for modest added latency. Skipping it is a common reason demos underperform.

Want to practice with real, verified ML interview questions from top companies?

Browse the question bank