In real-world regression tasks, your data often isn't nicely behaved. It's skewed, messy, and full of outliers. And yet, we often trust evaluation metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE) without questioning the composition of the test set.
But what if your test set lies?
The Problem: Skewed Targets and Misleading Metrics
Many real-world target variables are right-skewed. Think of medical costs, customer lifetime value, or housing prices: a few huge numbers, but most values are small.
If your test set contains only low values of the target variable, your MSE or MAE may look great. But this doesn't mean your model handles the full range of the problem — particularly the rare but important tail.
Let’s explore this using a real-world dataset.
Dataset: U.S. Medical Insurance Costs
We use the Medical Cost Personal Dataset, which includes features like age, sex, BMI, children, smoker status, and region, and a right-skewed target: charges (medical cost).
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import GridSearchCV, PredefinedSplit
import numpy as np
df = pd.read_csv("insurance.csv")
df = pd.get_dummies(df, drop_first=True)
X = df.drop("charges", axis=1).values
y = df["charges"].values
Split 1: Random Sampling
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X, y, test_size=0.2, random_state=42)
Split 2: Biased Low-y Test Set
sorted_indices = np.argsort(y)
X_sorted = X[sorted_indices]
y_sorted = y[sorted_indices]
n_test = len(y_test_r)
X_test_low = X_sorted[:n_test]
y_test_low = y_sorted[:n_test]
X_train_low = X_sorted[n_test:]
y_train_low = y_sorted[n_test:]
Split 3: Biased High-y Test Set
X_test_high = X_sorted[-n_test:]
y_test_high = y_sorted[-n_test:]
X_train_high = X_sorted[:-n_test]
y_train_high = y_sorted[:-n_test]
Split 4: Stratified by Quantiles
bins = pd.qcut(y, q=10, labels=False, duplicates="drop")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(sss.split(X, bins))
X_train_s, X_test_s = X[train_idx], X[test_idx]
y_train_s, y_test_s = y[train_idx], y[test_idx]
Hyperparameter Tuning
def hyperparameter_search(X_train, y_train, X_val, y_val):
X_combined = np.vstack([X_train, X_val])
y_combined = np.concatenate([y_train, y_val])
split_idx = [-1] * len(X_train) + [0] * len(X_val)
ps = PredefinedSplit(test_fold=split_idx)
grid = GridSearchCV(Ridge(), {'alpha': [0.01, 0.1, 1, 10, 100]}, cv=ps)
grid.fit(X_combined, y_combined)
return grid.best_estimator_
Results: Evaluation Across Splits
splits = {
"Random": (X_train_r, y_train_r, X_test_r, y_test_r),
"Biased Low-Y": (X_train_low, y_train_low, X_test_low, y_test_low),
"Biased High-Y": (X_train_high, y_train_high, X_test_high, y_test_high),
"Stratified": (X_train_s, y_train_s, X_test_s, y_test_s),
}
results = []
for name, (X_tr, y_tr, X_te, y_te) in splits.items():
model = hyperparameter_search(X_tr, y_tr, X_te, y_te)
y_pred = model.predict(X_te)
mse = mean_squared_error(y_te, y_pred)
mae = mean_absolute_error(y_te, y_pred)
results.append((name, mse, mae))
results_df = pd.DataFrame(results, columns=["Split", "MSE", "MAE"])
print(results_df)
Results: Evaluation Across Splits
Split | MSE | MAE |
---|---|---|
Random | 33,251,680 | 4,118.34 |
Biased Low-Y | 5,563,189 | 1,880.35 |
Biased High-Y | 121,632,500 | 9,934.32 |
Stratified | 36,873,480 | 4,189.87 |
What Do We Learn?
The Biased Low-Y split gives overly optimistic metrics. The Biased High-Y split gives a harsh but honest view of performance on extreme outcomes. The Stratified test offers a fair and consistent picture.
Takeaways
- Don't blindly trust random splits in regression, especially with skewed targets.
- Always inspect your test set's
y
distribution. - Use stratified or quantile-aware splits for a balanced evaluation.
- Measure performance by quantile slices to expose hidden failure modes.
Want to Go Further?
- Try other models like
RandomForestRegressor
,XGBoost
, orCatBoost
. - Visualize error vs. quantiles.
- Run multiple random splits to assess performance variance.
In regression, what you test on is just as important as how you measure.