Modeling¶
Cross-validation, feature importance, hyperparameter search, and scoring (AFML Ch. 6–8).
modeling ¶
PurgedKFold ¶
Purged K-Fold cross-validation with embargo (AFML Ch. 7).
Prevents information leakage in time-series data by purging training observations that overlap with test events, and optionally applying an embargo period after each test set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_splits
|
int
|
Number of folds. |
5
|
embargo_pct
|
float
|
Fraction of total observations to embargo after each test fold. |
0.0
|
split ¶
Generate train/test splits with purging and embargo.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
events
|
list[tuple[int, int]]
|
List of (entry_idx, exit_idx) pairs for each observation. |
required |
n_samples
|
int
|
Total number of samples (must equal len(events)). |
required |
Returns:
| Type | Description |
|---|---|
list[FoldIndices]
|
Train/test index pairs for each fold. |
accuracy_score ¶
Classification accuracy score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
True labels. |
required |
y_pred
|
ndarray
|
Predicted labels. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Fraction of correct predictions. |
bagging_accuracy ¶
Theoretical accuracy of a bagging ensemble (AFML Ch. 6).
Computes the probability that a majority of n classifiers,
each with individual accuracy p, vote correctly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of classifiers in the ensemble (should be odd). |
required |
p
|
float
|
Individual classifier accuracy (0 to 1). |
required |
Returns:
| Type | Description |
|---|---|
float
|
Ensemble accuracy. |
cv_score ¶
Cross-validated scoring with purged K-fold (AFML Ch. 7).
Trains and evaluates a classifier on each fold, returning per-fold scores. Uses purged K-fold to prevent leakage from overlapping labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classifier
|
object
|
An sklearn-compatible classifier with |
required |
x
|
ndarray
|
Feature matrix (n_samples, n_features). |
required |
y
|
ndarray
|
Label vector (n_samples,). |
required |
events
|
list[tuple[int, int]]
|
Event spans for purging. |
required |
n_splits
|
int
|
Number of CV folds. |
5
|
embargo_pct
|
float
|
Embargo fraction. |
0.0
|
sample_weight
|
ndarray
|
Per-sample weights for training. |
None
|
scoring
|
callable
|
Custom scoring function |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Array of per-fold scores. |
f1_score ¶
Binary F1 score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
True binary labels. |
required |
y_pred
|
ndarray
|
Predicted binary labels. |
required |
Returns:
| Type | Description |
|---|---|
float
|
F1 score (harmonic mean of precision and recall). |
grid_search ¶
Exhaustive grid search over parameter combinations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param_grids
|
list[tuple[str, list[float]]]
|
Each entry is (parameter_name, values_to_try). |
required |
score_fn
|
callable
|
Function |
required |
Returns:
| Type | Description |
|---|---|
dict
|
|
log_uniform_sample ¶
Sample from a log-uniform distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
low
|
float
|
Lower bound (> 0). |
required |
high
|
float
|
Upper bound. |
required |
n
|
int
|
Number of samples. |
required |
seed
|
int
|
Random seed. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Log-uniformly distributed samples. |
make_classification ¶
Generate a synthetic classification dataset for testing.
Creates a dataset with informative, redundant, and noise features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Number of observations. |
required |
n_informative
|
int
|
Number of truly informative features. |
required |
n_redundant
|
int
|
Number of redundant (linear combinations of informative) features. |
required |
n_noise
|
int
|
Number of pure noise features. |
required |
seed
|
int
|
Random seed. |
required |
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray]
|
(X, y) — feature matrix and binary labels. |
mean_decrease_accuracy ¶
Mean Decrease Accuracy (MDA) feature importance (AFML Ch. 8).
Measures each feature's importance by the drop in accuracy when the feature is permuted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classifier
|
object
|
A fitted sklearn-compatible classifier. |
required |
x
|
ndarray
|
Feature matrix (n_samples, n_features). |
required |
y
|
ndarray
|
True labels. |
required |
scoring
|
callable
|
Scoring function |
None
|
seed
|
int
|
Random seed for permutation. |
42
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Importance score per feature (higher = more important). |
mean_decrease_impurity ¶
Mean Decrease Impurity (MDI) feature importance.
Averages per-tree Gini importances from a random forest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
importances_per_tree
|
list[list[float]]
|
Feature importances from each tree (n_trees x n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Mean feature importance across trees. |
neg_log_loss ¶
Negative log-loss (cross-entropy).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
True binary labels (0 or 1). |
required |
y_proba
|
ndarray
|
Predicted probabilities for the positive class. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Negative log-loss (higher is better). |
orthogonal_features ¶
Extract orthogonal features via PCA.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
Feature matrix (n_samples, n_features). |
required |
n_components
|
int
|
Number of principal components to retain. |
required |
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray]
|
(transformed, explained_variance_ratio) — the projected data of shape (n_samples, n_components) and the variance explained by each component. |
random_search ¶
Random search over parameter distributions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
param_distributions
|
list[tuple[str, float, float]]
|
Each entry is (parameter_name, low, high) defining a uniform range. |
required |
n_iter
|
int
|
Number of random combinations to evaluate. |
required |
score_fn
|
callable
|
Function |
required |
seed
|
int
|
Random seed. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
|
single_feature_importance ¶
Single Feature Importance (SFI) — evaluate each feature independently (AFML Ch. 8).
Trains a separate model on each individual feature and reports cross-validated performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
classifier
|
object
|
An sklearn-compatible classifier. |
required |
x
|
ndarray
|
Feature matrix. |
required |
y
|
ndarray
|
Labels. |
required |
events
|
list[tuple[int, int]]
|
Event spans for purged CV. |
required |
n_splits
|
int
|
Number of CV folds. |
5
|
scoring
|
callable
|
Custom scorer. Defaults to accuracy. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Mean CV score per feature. |
weighted_kendall_tau ¶
Weighted Kendall tau rank correlation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
ndarray
|
First variable. |
required |
y
|
ndarray
|
Second variable. |
required |
weights
|
ndarray
|
Per-observation weights. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Weighted Kendall tau coefficient in [-1, 1]. |