Robust Regression¶
Standard FPC regression (fregre_lm) uses ordinary least squares, which is sensitive to outliers in the response. Robust regression methods replace the squared loss with loss functions that down-weight extreme residuals, yielding estimators that resist contamination.
fdars provides two robust alternatives:
| Method | Loss function | Breakdown point | Notes |
|---|---|---|---|
| L1 regression | \(\lvert r \rvert\) | 50% | Median regression; completely ignores outlier magnitude |
| Huber M-estimation | Quadratic near 0, linear in tails | Depends on \(k\) | Smooth compromise between L2 and L1 |
L1 regression¶
L1 (least absolute deviations) regression minimizes \(\sum_i |y_i - \hat{y}_i|\) instead of \(\sum_i (y_i - \hat{y}_i)^2\). This is equivalent to estimating the conditional median rather than the conditional mean.
import numpy as np
from fdars import Fdata
from fdars.regression import fregre_l1
# --- Simulate data with outliers ---
np.random.seed(42)
n, m = 100, 81
t = np.linspace(0, 1, m)
beta_true = np.sin(4 * np.pi * t)
raw = np.zeros((n, m))
for i in range(n):
raw[i] = (
np.random.randn() * np.sin(2 * np.pi * t)
+ np.random.randn() * np.cos(2 * np.pi * t)
+ 0.2 * np.random.randn(m)
)
fd = Fdata(raw, argvals=t)
response = np.trapz(fd.data * beta_true, fd.argvals, axis=1) + 0.3 * np.random.randn(n)
# Add 10% outliers
n_outliers = 10
outlier_idx = np.random.choice(n, n_outliers, replace=False)
response[outlier_idx] += 10 * np.random.randn(n_outliers)
# --- Fit L1 regression ---
result = fregre_l1(fd.data, response, n_comp=3)
fitted = result["fitted_values"] # (n,)
resid = result["residuals"] # (n,)
beta_l1 = result["beta_t"] # (m,) -- estimated beta(t)
print(f"L1 median absolute residual: {np.median(np.abs(resid)):.4f}")
| Key | Type | Description |
|---|---|---|
fitted_values |
ndarray (n,) |
Fitted values |
residuals |
ndarray (n,) |
Residuals |
beta_t |
ndarray (m,) |
Estimated coefficient function |
Huber M-estimation¶
Huber regression uses the Huber loss, which behaves like squared error for small residuals and like absolute error for large residuals:
The tuning constant \(k\) controls the transition point. The default \(k = 1.345\) gives 95% efficiency relative to OLS when the errors are truly Gaussian.
from fdars.regression import fregre_huber
result = fregre_huber(fd.data, response, n_comp=3, huber_k=1.345)
fitted = result["fitted_values"] # (n,)
resid = result["residuals"] # (n,)
beta_huber = result["beta_t"] # (m,)
print(f"Huber median absolute residual: {np.median(np.abs(resid)):.4f}")
| Key | Type | Description |
|---|---|---|
fitted_values |
ndarray (n,) |
Fitted values |
residuals |
ndarray (n,) |
Residuals |
beta_t |
ndarray (m,) |
Estimated coefficient function |
| Parameter | Default | Description |
|---|---|---|
n_comp |
3 | Number of FPC components |
huber_k |
1.345 | Huber tuning constant |
Choosing huber_k
huber_k |
Behavior | Efficiency (Gaussian) |
|---|---|---|
| 0.5 | Very robust, low efficiency | ~60% |
| 1.0 | Moderate robustness | ~89% |
| 1.345 | Standard choice | ~95% |
| 2.0 | Mild robustness | ~99% |
| \(\to \infty\) | Equivalent to OLS | 100% |
Comparing OLS, L1, and Huber¶
import numpy as np
import pandas as pd
from fdars import Fdata
from fdars.regression import fregre_lm, fregre_l1, fregre_huber
np.random.seed(0)
n, m = 120, 101
t = np.linspace(0, 1, m)
beta_true = np.exp(-((t - 0.5)**2) / 0.02)
# Clean data
raw = np.zeros((n, m))
for i in range(n):
raw[i] = sum(
np.random.randn() * np.sin((2*k+1) * np.pi * t)
for k in range(4)
) + 0.15 * np.random.randn(m)
fd = Fdata(raw, argvals=t)
response_clean = np.trapz(fd.data * beta_true, fd.argvals, axis=1) + 0.3 * np.random.randn(n)
# Contaminated response (15% outliers)
response = response_clean.copy()
contaminated = np.random.choice(n, int(0.15 * n), replace=False)
response[contaminated] += 8 * np.random.choice([-1, 1], size=len(contaminated))
# --- Fit all three ---
ols = fregre_lm(fd.data, response, n_comp=4)
l1 = fregre_l1(fd.data, response, n_comp=4)
huber = fregre_huber(fd.data, response, n_comp=4, huber_k=1.345)
# --- Evaluate beta recovery and prediction on clean observations ---
clean_idx = np.setdiff1d(np.arange(n), contaminated)
rows = []
for name, res in [("OLS", ols), ("L1", l1), ("Huber", huber)]:
corr = np.corrcoef(beta_true, res["beta_t"])[0, 1]
mse = np.mean((res["fitted_values"][clean_idx] - response_clean[clean_idx])**2)
rows.append({"method": name, "beta_corr": corr, "mse_clean": mse})
results_df = pd.DataFrame(rows)
print(results_df.to_string(index=False))
When to use robust methods¶
| Scenario | Recommendation |
|---|---|
| Clean data, no outliers | fregre_lm (OLS) -- most efficient |
| Suspected outliers in response | fregre_huber with default \(k=1.345\) |
| Known heavy contamination (\(>10\%\)) | fregre_l1 |
| Outliers in predictors (leverage points) | Pre-filter with outlier detection, then use any method |
| Heteroscedastic errors | fregre_huber with smaller \(k\) (e.g., 1.0) |
Warning
Robust methods protect against outliers in the response \(y_i\). They do not guard against leverage points (outlying \(x_i(t)\)). For high-leverage outliers, consider depth-based trimming before fitting.