Skip to content

Modeling

Cross-validation, feature importance, hyperparameter search, and scoring (AFML Ch. 6–8).

modeling

PurgedKFold

Purged K-Fold cross-validation with embargo (AFML Ch. 7).

Prevents information leakage in time-series data by purging training observations that overlap with test events, and optionally applying an embargo period after each test set.

Parameters:

Name Type Description Default
n_splits int

Number of folds.

5
embargo_pct float

Fraction of total observations to embargo after each test fold.

0.0

split

split(events, n_samples)

Generate train/test splits with purging and embargo.

Parameters:

Name Type Description Default
events list[tuple[int, int]]

List of (entry_idx, exit_idx) pairs for each observation.

required
n_samples int

Total number of samples (must equal len(events)).

required

Returns:

Type Description
list[FoldIndices]

Train/test index pairs for each fold.

accuracy_score

accuracy_score(y_true, y_pred)

Classification accuracy score.

Parameters:

Name Type Description Default
y_true ndarray

True labels.

required
y_pred ndarray

Predicted labels.

required

Returns:

Type Description
float

Fraction of correct predictions.

bagging_accuracy

bagging_accuracy(n, p)

Theoretical accuracy of a bagging ensemble (AFML Ch. 6).

Computes the probability that a majority of n classifiers, each with individual accuracy p, vote correctly.

Parameters:

Name Type Description Default
n int

Number of classifiers in the ensemble (should be odd).

required
p float

Individual classifier accuracy (0 to 1).

required

Returns:

Type Description
float

Ensemble accuracy.

cv_score

cv_score(classifier, x, y, events, n_splits=5, embargo_pct=0.0, sample_weight=None, scoring=None)

Cross-validated scoring with purged K-fold (AFML Ch. 7).

Trains and evaluates a classifier on each fold, returning per-fold scores. Uses purged K-fold to prevent leakage from overlapping labels.

Parameters:

Name Type Description Default
classifier object

An sklearn-compatible classifier with .fit(X, y) and .predict(X).

required
x ndarray

Feature matrix (n_samples, n_features).

required
y ndarray

Label vector (n_samples,).

required
events list[tuple[int, int]]

Event spans for purging.

required
n_splits int

Number of CV folds.

5
embargo_pct float

Embargo fraction.

0.0
sample_weight ndarray

Per-sample weights for training.

None
scoring callable

Custom scoring function f(y_true, y_pred) -> float. Defaults to accuracy.

None

Returns:

Type Description
ndarray

Array of per-fold scores.

f1_score

f1_score(y_true, y_pred)

Binary F1 score.

Parameters:

Name Type Description Default
y_true ndarray

True binary labels.

required
y_pred ndarray

Predicted binary labels.

required

Returns:

Type Description
float

F1 score (harmonic mean of precision and recall).

grid_search(param_grids, score_fn)

Exhaustive grid search over parameter combinations.

Parameters:

Name Type Description Default
param_grids list[tuple[str, list[float]]]

Each entry is (parameter_name, values_to_try).

required
score_fn callable

Function f(params_dict) -> float that evaluates a parameter set.

required

Returns:

Type Description
dict

{"best_params": {name: value, ...}, "best_score": float}

log_uniform_sample

log_uniform_sample(low, high, n, seed)

Sample from a log-uniform distribution.

Parameters:

Name Type Description Default
low float

Lower bound (> 0).

required
high float

Upper bound.

required
n int

Number of samples.

required
seed int

Random seed.

required

Returns:

Type Description
ndarray

Log-uniformly distributed samples.

make_classification

make_classification(n_samples, n_informative, n_redundant, n_noise, seed)

Generate a synthetic classification dataset for testing.

Creates a dataset with informative, redundant, and noise features.

Parameters:

Name Type Description Default
n_samples int

Number of observations.

required
n_informative int

Number of truly informative features.

required
n_redundant int

Number of redundant (linear combinations of informative) features.

required
n_noise int

Number of pure noise features.

required
seed int

Random seed.

required

Returns:

Type Description
tuple[ndarray, ndarray]

(X, y) — feature matrix and binary labels.

mean_decrease_accuracy

mean_decrease_accuracy(classifier, x, y, scoring=None, seed=42)

Mean Decrease Accuracy (MDA) feature importance (AFML Ch. 8).

Measures each feature's importance by the drop in accuracy when the feature is permuted.

Parameters:

Name Type Description Default
classifier object

A fitted sklearn-compatible classifier.

required
x ndarray

Feature matrix (n_samples, n_features).

required
y ndarray

True labels.

required
scoring callable

Scoring function f(y_true, y_pred) -> float. Defaults to accuracy.

None
seed int

Random seed for permutation.

42

Returns:

Type Description
ndarray

Importance score per feature (higher = more important).

mean_decrease_impurity

mean_decrease_impurity(importances_per_tree)

Mean Decrease Impurity (MDI) feature importance.

Averages per-tree Gini importances from a random forest.

Parameters:

Name Type Description Default
importances_per_tree list[list[float]]

Feature importances from each tree (n_trees x n_features).

required

Returns:

Type Description
ndarray

Mean feature importance across trees.

neg_log_loss

neg_log_loss(y_true, y_proba)

Negative log-loss (cross-entropy).

Parameters:

Name Type Description Default
y_true ndarray

True binary labels (0 or 1).

required
y_proba ndarray

Predicted probabilities for the positive class.

required

Returns:

Type Description
float

Negative log-loss (higher is better).

orthogonal_features

orthogonal_features(x, n_components)

Extract orthogonal features via PCA.

Parameters:

Name Type Description Default
x ndarray

Feature matrix (n_samples, n_features).

required
n_components int

Number of principal components to retain.

required

Returns:

Type Description
tuple[ndarray, ndarray]

(transformed, explained_variance_ratio) — the projected data of shape (n_samples, n_components) and the variance explained by each component.

random_search(param_distributions, n_iter, score_fn, seed)

Random search over parameter distributions.

Parameters:

Name Type Description Default
param_distributions list[tuple[str, float, float]]

Each entry is (parameter_name, low, high) defining a uniform range.

required
n_iter int

Number of random combinations to evaluate.

required
score_fn callable

Function f(params_dict) -> float.

required
seed int

Random seed.

required

Returns:

Type Description
dict

{"best_params": {name: value, ...}, "best_score": float}

single_feature_importance

single_feature_importance(classifier, x, y, events, n_splits=5, scoring=None)

Single Feature Importance (SFI) — evaluate each feature independently (AFML Ch. 8).

Trains a separate model on each individual feature and reports cross-validated performance.

Parameters:

Name Type Description Default
classifier object

An sklearn-compatible classifier.

required
x ndarray

Feature matrix.

required
y ndarray

Labels.

required
events list[tuple[int, int]]

Event spans for purged CV.

required
n_splits int

Number of CV folds.

5
scoring callable

Custom scorer. Defaults to accuracy.

None

Returns:

Type Description
ndarray

Mean CV score per feature.

weighted_kendall_tau

weighted_kendall_tau(x, y, weights=None)

Weighted Kendall tau rank correlation.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required
weights ndarray

Per-observation weights.

None

Returns:

Type Description
float

Weighted Kendall tau coefficient in [-1, 1].