How do you choose and explain classification evaluation metrics in an interview?

Updated June 18, 2026 · 7 min read · Crack ML Interview

TL;DR

Metric questions test whether you choose the right measure for the problem rather than reciting definitions. Accuracy is misleading on imbalanced data, so know precision, recall, their tradeoff, F1, and when to favor each based on the cost of false positives versus false negatives. Understand ROC-AUC as threshold-independent ranking quality and why PR-AUC is preferable under heavy class imbalance. The strongest answers tie the metric choice to the business cost of each error type, explain that the decision threshold is tunable and separate from the model, and avoid the classic trap of optimizing accuracy on a rare-positive problem where predicting the majority class scores high yet is useless.

Why Accuracy Misleads and What Replaces It

The accuracy trap on imbalanced data

Accuracy is the fraction of correct predictions and is intuitive but dangerous when classes are imbalanced. If one percent of transactions are fraudulent, a model that predicts not-fraud for everything achieves ninety-nine percent accuracy while catching zero fraud, making it worthless despite a high score. Interviewers deliberately pose rare-positive scenarios to see whether you flag this. The correct move is to reject accuracy for imbalanced problems and reach for precision, recall, and their derivatives, which focus on performance on the rare positive class rather than being dominated by the easy majority.

Precision, recall, and the tradeoff

Precision is the fraction of predicted positives that are actually positive, answering when the model says positive, how often is it right. Recall is the fraction of actual positives the model catches, answering of all true positives, how many did we find. They trade off: lowering the decision threshold catches more positives, raising recall but lowering precision, and vice versa. The interview skill is choosing which to prioritize by the cost of errors: favor recall when missing a positive is costly, such as disease screening or fraud, and favor precision when a false positive is costly, such as flagging legitimate users.

F1, ROC-AUC, and PR-AUC

F1 and when a single number helps

The F1 score is the harmonic mean of precision and recall, giving a single number that is high only when both are high, which is useful when you need one metric and care about both error types roughly equally. Because it is a harmonic mean, it punishes a large imbalance between precision and recall more than a simple average would. When the costs of the two error types differ, an F-beta score weights recall more or less heavily than precision. Note that F1 still depends on the chosen threshold, so it measures a single operating point rather than the model across all thresholds.

ROC-AUC versus PR-AUC

ROC-AUC measures the area under the receiver operating characteristic curve, which plots true positive rate against false positive rate across all thresholds, summarizing the model's ranking ability independent of any single threshold; it equals the probability that the model ranks a random positive above a random negative. Its weakness is that under heavy class imbalance it can look deceptively good because the false positive rate has a huge negative denominator. Precision-recall AUC focuses on the positive class and is more informative when positives are rare. The expected answer is ROC-AUC for balanced or ranking problems, PR-AUC for heavily imbalanced ones.

Tying Metrics to Business Cost

The threshold is tunable and the costs drive the choice

A point that distinguishes strong candidates is recognizing that the model outputs scores and the decision threshold is a separate, tunable knob chosen to hit a target precision or recall, not a fixed property of the model. This means you can move along the precision-recall curve to match business needs without retraining. The closing move is to anchor the metric choice in the asymmetric cost of errors: in fraud, a missed fraud may cost far more than a false alarm, pushing toward recall; in content moderation or user-facing flags, false positives erode trust, pushing toward precision. Naming the cost asymmetry is what turns a definitional answer into a decision-quality one.

Classification Metrics: Definition and When to Prioritize

Metric	Measures	Prioritize When	Caveat
Accuracy	Overall fraction correct	Balanced classes	Misleading on imbalanced data
Precision	Correctness of positive predictions	False positives are costly	Ignores missed positives
Recall	Coverage of actual positives	False negatives are costly	Ignores false alarms
F1	Harmonic mean of precision/recall	Both errors matter, one number needed	Threshold-dependent
ROC-AUC	Threshold-free ranking quality	Balanced or ranking problems	Optimistic under imbalance
PR-AUC	Positive-class precision-recall area	Heavy class imbalance	Harder to interpret directly

Who this is for

Candidate who defaults to accuracy and ROC-AUC

Profile: Knows accuracy and has heard of AUC, but reaches for them reflexively without considering class imbalance or the cost of different errors.

Pain points: Falls into the imbalance trap, quoting high accuracy on a rare-positive problem, and cannot justify when PR-AUC is more appropriate than ROC-AUC.

Strategy: Internalize the imbalance trap and the ROC-AUC-versus-PR-AUC distinction. Practice answering metric questions by first asking about class balance and error costs, then selecting the metric, which demonstrates the decision-quality the question is really testing.

Practitioner who knows definitions but not the business framing

Profile: Can define precision, recall, F1, and AUC accurately, but answers metric questions mechanically without connecting them to the cost of errors in the specific problem.

Pain points: Gives correct but flat definitions and misses the chance to show judgment by tying the metric choice to false-positive versus false-negative costs.

Strategy: Always close by anchoring the metric to the business cost asymmetry and noting that the decision threshold is a separate tunable knob. Framing metric selection as a cost-driven decision, rather than a definition recall, is what elevates the answer with interviewers.

FAQ

Q: Why is accuracy a bad metric for imbalanced data?

A: Because a model can score very high accuracy by always predicting the majority class while completely failing on the rare class that you actually care about. On a one-percent-positive problem, predicting all-negative gives ninety-nine percent accuracy yet catches nothing, so you should use precision, recall, F1, or PR-AUC instead.

Q: When should I use ROC-AUC versus PR-AUC?

A: Use ROC-AUC for balanced classes or when you care about overall ranking quality across thresholds. Use PR-AUC when positives are rare, because ROC-AUC can look deceptively strong under heavy imbalance due to the large negative denominator in the false positive rate, while PR-AUC focuses on performance on the rare positive class.

Q: How do I decide between optimizing for precision or recall?

A: Base it on the cost of each error type. Favor recall when missing a positive is expensive, such as disease screening or fraud detection. Favor precision when a false positive is expensive, such as flagging legitimate users or content. Since the decision threshold is tunable, you can move along the precision-recall curve to match the business need without retraining.

Want to practice with real, verified ML interview questions from top companies?

Browse the question bank