When Your Test Set Lies: Uncovering Bias in Regression Evaluation • Abdurahman Ali Mohammed

In real-world regression tasks, your data often isn't nicely behaved. It's skewed, messy, and full of outliers. And yet, we often trust evaluation metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE) without questioning the composition of the test set.

But what if your test set lies?

The Problem: Skewed Targets and Misleading Metrics

Many real-world target variables are right-skewed. Think of medical costs, customer lifetime value, or housing prices: a few huge numbers, but most values are small.

If your test set contains only low values of the target variable, your MSE or MAE may look great. But this doesn't mean your model handles the full range of the problem — particularly the rare but important tail.

Let’s explore this using a real-world dataset.

Dataset: U.S. Medical Insurance Costs

We use the Medical Cost Personal Dataset, which includes features like age, sex, BMI, children, smoker status, and region, and a right-skewed target: charges (medical cost).

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import GridSearchCV, PredefinedSplit
import numpy as np

df = pd.read_csv("insurance.csv")
df = pd.get_dummies(df, drop_first=True)  
X = df.drop("charges", axis=1).values
y = df["charges"].values

Split 1: Random Sampling

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X, y, test_size=0.2, random_state=42)

Split 2: Biased Low-y Test Set

sorted_indices = np.argsort(y)
X_sorted = X[sorted_indices]
y_sorted = y[sorted_indices]
n_test = len(y_test_r)

X_test_low = X_sorted[:n_test]
y_test_low = y_sorted[:n_test]
X_train_low = X_sorted[n_test:]
y_train_low = y_sorted[n_test:]

Split 3: Biased High-y Test Set

X_test_high = X_sorted[-n_test:]
y_test_high = y_sorted[-n_test:]
X_train_high = X_sorted[:-n_test]
y_train_high = y_sorted[:-n_test]

Split 4: Stratified by Quantiles

bins = pd.qcut(y, q=10, labels=False, duplicates="drop")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(sss.split(X, bins))
X_train_s, X_test_s = X[train_idx], X[test_idx]
y_train_s, y_test_s = y[train_idx], y[test_idx]

Hyperparameter Tuning

def hyperparameter_search(X_train, y_train, X_val, y_val):
    X_combined = np.vstack([X_train, X_val])
    y_combined = np.concatenate([y_train, y_val])
    split_idx = [-1] * len(X_train) + [0] * len(X_val)
    ps = PredefinedSplit(test_fold=split_idx)

    grid = GridSearchCV(Ridge(), {'alpha': [0.01, 0.1, 1, 10, 100]}, cv=ps)
    grid.fit(X_combined, y_combined)
    return grid.best_estimator_

Results: Evaluation Across Splits

splits = {
    "Random": (X_train_r, y_train_r, X_test_r, y_test_r),
    "Biased Low-Y": (X_train_low, y_train_low, X_test_low, y_test_low),
    "Biased High-Y": (X_train_high, y_train_high, X_test_high, y_test_high),
    "Stratified": (X_train_s, y_train_s, X_test_s, y_test_s),
}

results = []

for name, (X_tr, y_tr, X_te, y_te) in splits.items():
    model = hyperparameter_search(X_tr, y_tr, X_te, y_te)
    y_pred = model.predict(X_te)
    mse = mean_squared_error(y_te, y_pred)
    mae = mean_absolute_error(y_te, y_pred)
    results.append((name, mse, mae))

results_df = pd.DataFrame(results, columns=["Split", "MSE", "MAE"])
print(results_df)

Results: Evaluation Across Splits

Split	MSE	MAE
Random	33,251,680	4,118.34
Biased Low-Y	5,563,189	1,880.35
Biased High-Y	121,632,500	9,934.32
Stratified	36,873,480	4,189.87

What Do We Learn?

The Biased Low-Y split gives overly optimistic metrics. The Biased High-Y split gives a harsh but honest view of performance on extreme outcomes. The Stratified test offers a fair and consistent picture.

Takeaways

Don't blindly trust random splits in regression, especially with skewed targets.
Always inspect your test set's y distribution.
Use stratified or quantile-aware splits for a balanced evaluation.
Measure performance by quantile slices to expose hidden failure modes.

Want to Go Further?

Try other models like RandomForestRegressor, XGBoost, or CatBoost.
Visualize error vs. quantiles.
Run multiple random splits to assess performance variance.

In regression, what you test on is just as important as how you measure.