Modeling¶

Cross-validation, feature importance, hyperparameter search, and scoring (AFML Ch. 6–8).

modeling ¶

PurgedKFold ¶

Purged K-Fold cross-validation with embargo (AFML Ch. 7).

Prevents information leakage in time-series data by purging training observations that overlap with test events, and optionally applying an embargo period after each test set.

Parameters:

Name	Type	Description	Default
`n_splits`	`int`	Number of folds.	`5`
`embargo_pct`	`float`	Fraction of total observations to embargo after each test fold.	`0.0`

split ¶

split(events, n_samples)

Generate train/test splits with purging and embargo.

Parameters:

Name	Type	Description	Default
`events`	`list[tuple[int, int]]`	List of (entry_idx, exit_idx) pairs for each observation.	required
`n_samples`	`int`	Total number of samples (must equal len(events)).	required

Returns:

Type	Description
`list[FoldIndices]`	Train/test index pairs for each fold.

accuracy_score ¶

accuracy_score(y_true, y_pred)

Classification accuracy score.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	True labels.	required
`y_pred`	`ndarray`	Predicted labels.	required

Returns:

Type	Description
`float`	Fraction of correct predictions.

bagging_accuracy ¶

bagging_accuracy(n, p)

Theoretical accuracy of a bagging ensemble (AFML Ch. 6).

Computes the probability that a majority of n classifiers, each with individual accuracy p, vote correctly.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of classifiers in the ensemble (should be odd).	required
`p`	`float`	Individual classifier accuracy (0 to 1).	required

Returns:

Type	Description
`float`	Ensemble accuracy.

cv_score ¶

cv_score(classifier, x, y, events, n_splits=5, embargo_pct=0.0, sample_weight=None, scoring=None)

Cross-validated scoring with purged K-fold (AFML Ch. 7).

Trains and evaluates a classifier on each fold, returning per-fold scores. Uses purged K-fold to prevent leakage from overlapping labels.

Parameters:

Name	Type	Description	Default
`classifier`	`object`	An sklearn-compatible classifier with `.fit(X, y)` and `.predict(X)`.	required
`x`	`ndarray`	Feature matrix (n_samples, n_features).	required
`y`	`ndarray`	Label vector (n_samples,).	required
`events`	`list[tuple[int, int]]`	Event spans for purging.	required
`n_splits`	`int`	Number of CV folds.	`5`
`embargo_pct`	`float`	Embargo fraction.	`0.0`
`sample_weight`	`ndarray`	Per-sample weights for training.	`None`
`scoring`	`callable`	Custom scoring function `f(y_true, y_pred) -> float`. Defaults to accuracy.	`None`

Returns:

Type	Description
`ndarray`	Array of per-fold scores.

f1_score ¶

f1_score(y_true, y_pred)

Binary F1 score.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	True binary labels.	required
`y_pred`	`ndarray`	Predicted binary labels.	required

Returns:

Type	Description
`float`	F1 score (harmonic mean of precision and recall).

grid_search ¶

grid_search(param_grids, score_fn)

Exhaustive grid search over parameter combinations.

Parameters:

Name	Type	Description	Default
`param_grids`	`list[tuple[str, list[float]]]`	Each entry is (parameter_name, values_to_try).	required
`score_fn`	`callable`	Function `f(params_dict) -> float` that evaluates a parameter set.	required

Returns:

Type	Description
`dict`	`{"best_params": {name: value, ...}, "best_score": float}`

log_uniform_sample ¶

log_uniform_sample(low, high, n, seed)

Sample from a log-uniform distribution.

Parameters:

Name	Type	Description	Default
`low`	`float`	Lower bound (> 0).	required
`high`	`float`	Upper bound.	required
`n`	`int`	Number of samples.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`ndarray`	Log-uniformly distributed samples.

make_classification ¶

make_classification(n_samples, n_informative, n_redundant, n_noise, seed)

Generate a synthetic classification dataset for testing.

Creates a dataset with informative, redundant, and noise features.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of observations.	required
`n_informative`	`int`	Number of truly informative features.	required
`n_redundant`	`int`	Number of redundant (linear combinations of informative) features.	required
`n_noise`	`int`	Number of pure noise features.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	(X, y) — feature matrix and binary labels.

mean_decrease_accuracy ¶

mean_decrease_accuracy(classifier, x, y, scoring=None, seed=42)

Mean Decrease Accuracy (MDA) feature importance (AFML Ch. 8).

Measures each feature's importance by the drop in accuracy when the feature is permuted.

Parameters:

Name	Type	Description	Default
`classifier`	`object`	A fitted sklearn-compatible classifier.	required
`x`	`ndarray`	Feature matrix (n_samples, n_features).	required
`y`	`ndarray`	True labels.	required
`scoring`	`callable`	Scoring function `f(y_true, y_pred) -> float`. Defaults to accuracy.	`None`
`seed`	`int`	Random seed for permutation.	`42`

Returns:

Type	Description
`ndarray`	Importance score per feature (higher = more important).

mean_decrease_impurity ¶

mean_decrease_impurity(importances_per_tree)

Mean Decrease Impurity (MDI) feature importance.

Averages per-tree Gini importances from a random forest.

Parameters:

Name	Type	Description	Default
`importances_per_tree`	`list[list[float]]`	Feature importances from each tree (n_trees x n_features).	required

Returns:

Type	Description
`ndarray`	Mean feature importance across trees.

neg_log_loss ¶

neg_log_loss(y_true, y_proba)

Negative log-loss (cross-entropy).

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	True binary labels (0 or 1).	required
`y_proba`	`ndarray`	Predicted probabilities for the positive class.	required

Returns:

Type	Description
`float`	Negative log-loss (higher is better).

orthogonal_features ¶

orthogonal_features(x, n_components)

Extract orthogonal features via PCA.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	Feature matrix (n_samples, n_features).	required
`n_components`	`int`	Number of principal components to retain.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	(transformed, explained_variance_ratio) — the projected data of shape (n_samples, n_components) and the variance explained by each component.

random_search ¶

random_search(param_distributions, n_iter, score_fn, seed)

Random search over parameter distributions.

Parameters:

Name	Type	Description	Default
`param_distributions`	`list[tuple[str, float, float]]`	Each entry is (parameter_name, low, high) defining a uniform range.	required
`n_iter`	`int`	Number of random combinations to evaluate.	required
`score_fn`	`callable`	Function `f(params_dict) -> float`.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`dict`	`{"best_params": {name: value, ...}, "best_score": float}`

single_feature_importance ¶

single_feature_importance(classifier, x, y, events, n_splits=5, scoring=None)

Single Feature Importance (SFI) — evaluate each feature independently (AFML Ch. 8).

Trains a separate model on each individual feature and reports cross-validated performance.

Parameters:

Name	Type	Description	Default
`classifier`	`object`	An sklearn-compatible classifier.	required
`x`	`ndarray`	Feature matrix.	required
`y`	`ndarray`	Labels.	required
`events`	`list[tuple[int, int]]`	Event spans for purged CV.	required
`n_splits`	`int`	Number of CV folds.	`5`
`scoring`	`callable`	Custom scorer. Defaults to accuracy.	`None`

Returns:

Type	Description
`ndarray`	Mean CV score per feature.

weighted_kendall_tau ¶

weighted_kendall_tau(x, y, weights=None)

Weighted Kendall tau rank correlation.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	First variable.	required
`y`	`ndarray`	Second variable.	required
`weights`	`ndarray`	Per-observation weights.	`None`

Returns:

Type	Description
`float`	Weighted Kendall tau coefficient in [-1, 1].