Skip to content

Features

Feature engineering: structural breaks, entropy, microstructure, RMT denoising, portfolio allocation, clustering, and codependence measures.

Structural Breaks

SADF, GSADF, and CUSUM tests for detecting regime changes (AFML Ch. 17).

Entropy

Shannon, Lempel-Ziv, Kontoyiannis, and Gaussian entropy estimators (AFML Ch. 18).

Microstructure

Market microstructure features: Amihud lambda, Kyle lambda, Hasbrouck lambda, Roll spread, Corwin-Schultz spread, VPIN (AFML Ch. 19).

Denoising

Random Matrix Theory (RMT) denoising and detoning of correlation/covariance matrices.

Allocation

Portfolio allocation: HRP, CLA (min-variance and max-Sharpe), and Inverse Variance (AFML Ch. 16).

Clustering

K-means clustering and Optimal Number of Clusters (ONC) algorithm (AFML Ch. 16).

Codependence

Pairwise dependence measures: Spearman, distance correlation, mutual information, variation of information, optimal transport, angular/GPR/GNPR distances.

features

adf_test

adf_test(series, max_lags)

Augmented Dickey-Fuller unit root test.

Tests whether a time series is stationary by fitting an autoregressive model.

Parameters:

Name Type Description Default
series ndarray

Time series to test.

required
max_lags int

Maximum number of autoregressive lags.

required

Returns:

Type Description
tuple[float, ndarray]

(adf_statistic, regression_coefficients).

amihud_lambda

amihud_lambda(returns, dollar_volumes)

Amihud illiquidity measure (AFML Ch. 19).

Measures price impact as the average ratio of absolute return to dollar volume.

Parameters:

Name Type Description Default
returns ndarray

Return series.

required
dollar_volumes ndarray

Dollar volume series (same length as returns).

required

Returns:

Type Description
float

Amihud lambda (higher = less liquid).

amihud_lambda_rolling

amihud_lambda_rolling(returns, dollar_volumes, window)

Rolling Amihud lambda over a sliding window.

Parameters:

Name Type Description Default
returns ndarray

Return series.

required
dollar_volumes ndarray

Dollar volume series.

required
window int

Rolling window size.

required

Returns:

Type Description
ndarray

Rolling Amihud lambda values.

angular_distance

angular_distance(x, y)

Angular distance derived from Pearson correlation.

d = sqrt(0.5 * (1 - rho))

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required

Returns:

Type Description
float

Angular distance in [0, 1].

binary_encode

binary_encode(values)

Binary encode a real-valued series (above/below median).

Parameters:

Name Type Description Default
values ndarray

Input series.

required

Returns:

Type Description
list[bool]

True where value >= median, False otherwise.

brown_durbin_evans

brown_durbin_evans(residuals)

Brown-Durbin-Evans CUSUM test for parameter instability.

Parameters:

Name Type Description Default
residuals ndarray

OLS regression residuals.

required

Returns:

Type Description
tuple[ndarray, float]

(cusum_series, critical_value) — the CUSUM path and 5% significance boundary.

chu_stinchcombe_white

chu_stinchcombe_white(log_prices, critical_value)

Chu-Stinchcombe-White CUSUM test for structural breaks in log prices.

Parameters:

Name Type Description Default
log_prices ndarray

Log price series.

required
critical_value float

Significance threshold.

required

Returns:

Type Description
ndarray

CUSUM statistic series.

cla_max_sharpe

cla_max_sharpe(expected_returns, cov)

CLA maximum Sharpe ratio portfolio.

Parameters:

Name Type Description Default
expected_returns ndarray

Expected return per asset (n,).

required
cov ndarray

Covariance matrix (n x n).

required

Returns:

Type Description
ndarray

Max-Sharpe weights (n,).

cla_min_variance

cla_min_variance(cov)

Critical Line Algorithm (CLA) minimum-variance portfolio.

Parameters:

Name Type Description Default
cov ndarray

Covariance matrix (n x n).

required

Returns:

Type Description
ndarray

Minimum-variance weights (n,).

cluster_kmeans_base

cluster_kmeans_base(corr, max_clusters=None, min_clusters=None, n_init=None, seed=None)

Base-level K-means clustering over a range of k values.

Tries multiple cluster counts and selects the one with the best silhouette score.

Parameters:

Name Type Description Default
corr ndarray

Correlation matrix.

required
max_clusters int

Maximum k to try.

None
min_clusters int

Minimum k to try.

None
n_init int

Initializations per k.

None
seed int

Random seed.

None

Returns:

Type Description
OncResult

Labels, silhouette score, and optimal cluster count.

cluster_kmeans_top

cluster_kmeans_top(corr, max_clusters=None, min_clusters=None, n_init=None, seed=None)

Top-level ONC (Optimal Number of Clusters) algorithm (AFML Ch. 16).

Two-step approach: first clusters, then re-clusters to find the optimal grouping.

Parameters:

Name Type Description Default
corr ndarray

Correlation matrix.

required
max_clusters int

Maximum k.

None
min_clusters int

Minimum k.

None
n_init int

Initializations per k.

None
seed int

Random seed.

None

Returns:

Type Description
OncResult

Labels, silhouette score, and optimal cluster count.

compare_allocations

compare_allocations(returns, n_simulations, seed)

Monte Carlo comparison of HRP, CLA, and IVP allocation methods.

Simulates random correlation matrices and compares out-of-sample Sharpe ratios and variances.

Parameters:

Name Type Description Default
returns ndarray

Return matrix (n_periods, n_assets).

required
n_simulations int

Number of Monte Carlo trials.

required
seed int

Random seed.

required

Returns:

Type Description
AllocationComparison

Sharpe ratios and variances for each method.

corr_to_cov

corr_to_cov(corr, std)

Convert a correlation matrix + standard deviations back to a covariance matrix.

Parameters:

Name Type Description Default
corr ndarray

Correlation matrix (n x n).

required
std ndarray

Standard deviations (n,).

required

Returns:

Type Description
ndarray

Covariance matrix (n x n).

corwin_schultz_spread

corwin_schultz_spread(highs, lows)

Corwin-Schultz spread estimator from high-low prices.

Estimates the bid-ask spread from consecutive high-low price pairs.

Parameters:

Name Type Description Default
highs ndarray

High price series.

required
lows ndarray

Low price series.

required

Returns:

Type Description
ndarray

Estimated spread series.

cov_to_corr

cov_to_corr(cov)

Convert a covariance matrix to a correlation matrix + standard deviations.

Parameters:

Name Type Description Default
cov ndarray

Covariance matrix (n x n).

required

Returns:

Type Description
tuple[ndarray, ndarray]

(correlation_matrix, std_devs).

denoise_corr

denoise_corr(corr, q, bandwidth=None, shrinkage=False, alpha=None)

Denoise a correlation matrix using Random Matrix Theory (AFML Ch. 2).

Shrinks eigenvalues below the Marcenko-Pastur bound toward their average, removing noise while preserving the signal.

Parameters:

Name Type Description Default
corr ndarray

Empirical correlation matrix.

required
q float

Ratio T/N.

required
bandwidth float

KDE bandwidth for eigenvalue fitting. Auto-selected if None.

None
shrinkage bool

Use shrinkage-based denoising instead of constant residual eigenvalue.

False
alpha float

Shrinkage intensity (0 to 1). Only used if shrinkage=True.

None

Returns:

Type Description
ndarray

Denoised correlation matrix.

denoise_cov

denoise_cov(cov, q, bandwidth=None)

Denoise a covariance matrix using RMT.

Converts to correlation, denoises, then converts back.

Parameters:

Name Type Description Default
cov ndarray

Empirical covariance matrix.

required
q float

Ratio T/N.

required
bandwidth float

KDE bandwidth.

None

Returns:

Type Description
ndarray

Denoised covariance matrix.

dependence_matrix

dependence_matrix(data, method)

Compute a pairwise dependence matrix using the specified method.

Parameters:

Name Type Description Default
data ndarray

Data matrix (n_observations, n_variables).

required
method str

One of: "pearson", "spearman", "distance_correlation", "mutual_information", "variation_of_information".

required

Returns:

Type Description
ndarray

Symmetric dependence matrix (n_variables x n_variables).

detone_corr

detone_corr(corr, n_components)

Remove the market component from a correlation matrix (detoning).

Subtracts the first n_components principal components to remove common factors (e.g. the market mode).

Parameters:

Name Type Description Default
corr ndarray

Correlation matrix.

required
n_components int

Number of leading eigenvectors to remove (usually 1 for market mode).

required

Returns:

Type Description
ndarray

Detoned correlation matrix.

distance_correlation

distance_correlation(x, y)

Distance correlation — a measure of dependence for non-linear relationships.

Unlike Pearson correlation, distance correlation is zero if and only if the variables are independent.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required

Returns:

Type Description
float

Distance correlation in [0, 1].

distance_matrix

distance_matrix(corr, metric)

Convert a correlation matrix to a distance matrix.

Parameters:

Name Type Description Default
corr ndarray

Correlation matrix.

required
metric str

One of: "angular", "absolute_angular", "squared_angular".

required

Returns:

Type Description
ndarray

Distance matrix (n x n).

entropy_implied_vol

entropy_implied_vol(entropy)

Implied volatility from Gaussian entropy.

Inverts the Gaussian entropy formula to recover the standard deviation.

Parameters:

Name Type Description Default
entropy float

Gaussian entropy value.

required

Returns:

Type Description
float

Implied volatility (standard deviation).

fit_kde

fit_kde(observations, bandwidth, eval_points)

Kernel Density Estimation (KDE) for eigenvalue distribution fitting.

Parameters:

Name Type Description Default
observations ndarray

Observed eigenvalues.

required
bandwidth float

Gaussian kernel bandwidth.

required
eval_points ndarray

Points at which to evaluate the KDE.

required

Returns:

Type Description
ndarray

KDE density values at the evaluation points.

gaussian_entropy

gaussian_entropy(variance)

Gaussian entropy for a given variance.

H = 0.5 * log2(2 * pi * e * variance)

Parameters:

Name Type Description Default
variance float

Variance of the Gaussian distribution.

required

Returns:

Type Description
float

Differential entropy in bits.

get_feature_clusters

get_feature_clusters(data, max_clusters=None, seed=None)

Cluster features using ONC on their correlation structure.

Parameters:

Name Type Description Default
data ndarray

Data matrix (n_samples, n_features).

required
max_clusters int

Maximum number of feature clusters.

None
seed int

Random seed.

None

Returns:

Type Description
OncResult

Feature cluster labels and quality metrics.

gnpr_distance

gnpr_distance(x, y, theta, n_bins=None)

GNPR (Generalized Non-Parametric Rank) distance.

Combines rank correlation with an information-theoretic component.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required
theta float

Co-movement threshold.

required
n_bins int

Number of bins for the information component.

None

Returns:

Type Description
float

GNPR distance.

gpr_distance

gpr_distance(x, y, theta)

GPR (Gerber-Podolskij-Reisenhofer) distance with threshold.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required
theta float

Co-movement threshold.

required

Returns:

Type Description
float

GPR distance.

gsadf

gsadf(series, min_window, max_lags)

Generalized SADF (GSADF) test series (AFML Ch. 17).

Tests for explosive behavior using flexible start/end windows, providing higher power than SADF for detecting multiple bubbles.

Parameters:

Name Type Description Default
series ndarray

Time series.

required
min_window int

Minimum regression window.

required
max_lags int

Maximum lags.

required

Returns:

Type Description
ndarray

GSADF statistic series.

gsadf_stat

gsadf_stat(series, min_window, max_lags)

Generalized SADF scalar statistic.

Parameters:

Name Type Description Default
series ndarray

Time series.

required
min_window int

Minimum regression window.

required
max_lags int

Maximum lags.

required

Returns:

Type Description
float

GSADF test statistic.

hasbrouck_lambda

hasbrouck_lambda(returns, trade_signs, n_iterations, seed)

Hasbrouck's lambda via Gibbs sampling (AFML Ch. 19).

Estimates permanent price impact accounting for trade sign uncertainty using a Bayesian approach.

Parameters:

Name Type Description Default
returns ndarray

Return series.

required
trade_signs ndarray

Signed trade indicators (+1 or -1).

required
n_iterations int

Number of Gibbs sampling iterations.

required
seed int

Random seed.

required

Returns:

Type Description
float

Hasbrouck lambda estimate.

hrp_weights

hrp_weights(returns)

Hierarchical Risk Parity (HRP) portfolio weights (AFML Ch. 16).

Uses hierarchical clustering on the correlation matrix to build a diversified portfolio that is more stable than mean-variance.

Parameters:

Name Type Description Default
returns ndarray

Return matrix (n_periods, n_assets).

required

Returns:

Type Description
ndarray

Portfolio weights (n_assets,), sums to 1.

inverse_variance_weights

inverse_variance_weights(cov)

Inverse Variance Portfolio (IVP) weights.

Weights each asset inversely proportional to its variance (diagonal of cov).

Parameters:

Name Type Description Default
cov ndarray

Covariance matrix (n x n).

required

Returns:

Type Description
ndarray

IVP weights (n,), sums to 1.

kmeans

kmeans(data, k, max_iter=300, n_init=10, seed=42)

K-means clustering with multiple initializations.

Parameters:

Name Type Description Default
data ndarray

Data matrix (n_samples, n_features).

required
k int

Number of clusters.

required
max_iter int

Maximum iterations per run.

300
n_init int

Number of random initializations (best is kept).

10
seed int

Random seed.

42

Returns:

Type Description
KMeansResult

Cluster labels, centroids, and iteration count.

kontoyiannis_entropy

kontoyiannis_entropy(sequence, window)

Kontoyiannis entropy estimator using longest-match lengths (AFML Ch. 18).

A non-parametric entropy estimator based on how far back one must look to find a match for each substring.

Parameters:

Name Type Description Default
sequence list[int]

Discrete symbol sequence.

required
window int

Maximum look-back window.

required

Returns:

Type Description
float

Estimated entropy rate.

kyle_lambda

kyle_lambda(returns, signed_volume)

Kyle's lambda — price impact from signed order flow (AFML Ch. 19).

Regresses returns on signed volume to estimate the permanent price impact of trades.

Parameters:

Name Type Description Default
returns ndarray

Return series.

required
signed_volume ndarray

Net signed volume (buy - sell).

required

Returns:

Type Description
float

Kyle lambda coefficient.

lempel_ziv_complexity

lempel_ziv_complexity(binary_string)

Lempel-Ziv complexity of a binary string (AFML Ch. 18).

Counts the number of distinct substrings encountered during a sequential parse — a measure of randomness/compressibility.

Parameters:

Name Type Description Default
binary_string list[bool]

Binary sequence.

required

Returns:

Type Description
int

Number of distinct patterns (Lempel-Ziv complexity).

marcenko_pastur_pdf

marcenko_pastur_pdf(var, q, pts)

Marcenko-Pastur probability density function.

Theoretical distribution of eigenvalues for a random correlation matrix with ratio q = T/N.

Parameters:

Name Type Description Default
var float

Variance of the random matrix entries.

required
q float

Ratio T/N (observations / variables).

required
pts int

Number of evaluation points.

required

Returns:

Type Description
tuple[ndarray, ndarray]

(x_values, pdf_values).

mutual_information

mutual_information(x, y, n_bins=None, normalize=False)

Mutual information between two continuous variables.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required
n_bins int

Number of histogram bins. Auto-selected if None.

None
normalize bool

If True, normalize to [0, 1] range.

False

Returns:

Type Description
float

Mutual information (non-negative).

optimal_portfolio

optimal_portfolio(cov, mu=None)

Optimal portfolio weights from a (denoised) covariance matrix.

Computes the minimum-variance or max-Sharpe portfolio.

Parameters:

Name Type Description Default
cov ndarray

Covariance matrix.

required
mu ndarray

Expected returns. If None, computes minimum-variance portfolio.

None

Returns:

Type Description
ndarray

Portfolio weights (sums to 1).

optimal_transport_dependence

optimal_transport_dependence(x, y)

Optimal transport dependence measure.

Based on the Wasserstein distance between joint and product marginal distributions.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required

Returns:

Type Description
float

Optimal transport dependence (non-negative).

plugin_entropy

plugin_entropy(sequence, num_symbols)

Plug-in (maximum likelihood) entropy estimator.

Estimates entropy from a discrete symbol sequence using empirical frequencies.

Parameters:

Name Type Description Default
sequence list[int]

Discrete symbol sequence.

required
num_symbols int

Number of distinct symbols in the alphabet.

required

Returns:

Type Description
float

Estimated entropy.

quantile_encode

quantile_encode(values, num_bins)

Quantile-based discretization of a continuous series.

Parameters:

Name Type Description Default
values ndarray

Input series.

required
num_bins int

Number of quantile bins.

required

Returns:

Type Description
list[int]

Bin index (0 to num_bins-1) for each value.

roll_spread

roll_spread(prices)

Roll model bid-ask spread estimator.

Estimates the effective spread from the autocovariance of price changes.

Parameters:

Name Type Description Default
prices ndarray

Price series.

required

Returns:

Type Description
float

Estimated bid-ask spread.

roll_spread_rolling

roll_spread_rolling(prices, window)

Rolling Roll model spread estimate.

Parameters:

Name Type Description Default
prices ndarray

Price series.

required
window int

Rolling window size.

required

Returns:

Type Description
ndarray

Rolling spread estimates.

sadf

sadf(series, min_window, max_lags)

Supremum Augmented Dickey-Fuller (SADF) test series (AFML Ch. 17).

Computes a sequence of ADF statistics with expanding windows starting from min_window.

Parameters:

Name Type Description Default
series ndarray

Time series (e.g. log prices).

required
min_window int

Minimum regression window.

required
max_lags int

Maximum lags per ADF regression.

required

Returns:

Type Description
ndarray

SADF statistic series.

sadf_stat

sadf_stat(series, min_window, max_lags)

Supremum ADF scalar statistic — the maximum of the SADF series.

Parameters:

Name Type Description Default
series ndarray

Time series.

required
min_window int

Minimum regression window.

required
max_lags int

Maximum lags.

required

Returns:

Type Description
float

SADF test statistic.

shannon_entropy

shannon_entropy(probs)

Shannon entropy from a probability distribution.

Parameters:

Name Type Description Default
probs ndarray

Probability vector (should sum to 1).

required

Returns:

Type Description
float

Shannon entropy in bits (log base 2).

sigma_encode

sigma_encode(values, num_bands)

Sigma-based encoding using standard deviation bands.

Parameters:

Name Type Description Default
values ndarray

Input series.

required
num_bands int

Number of sigma bands on each side of the mean.

required

Returns:

Type Description
list[int]

Band index for each value.

silhouette_score

silhouette_score(data, labels)

Silhouette score measuring clustering quality.

Ranges from -1 (poor) to +1 (excellent).

Parameters:

Name Type Description Default
data ndarray

Data matrix (n_samples, n_features).

required
labels list[int]

Cluster labels for each sample.

required

Returns:

Type Description
float

Mean silhouette score.

sm_exp

sm_exp(series)

Sub/super-martingale test with exponential kernel.

Parameters:

Name Type Description Default
series ndarray

Time series.

required

Returns:

Type Description
float

Test statistic.

sm_poly

sm_poly(series, degree)

Sub/super-martingale test with polynomial kernel.

Parameters:

Name Type Description Default
series ndarray

Time series.

required
degree int

Polynomial degree for the test.

required

Returns:

Type Description
float

Test statistic.

sm_power

sm_power(series, power)

Sub/super-martingale test with power kernel.

Parameters:

Name Type Description Default
series ndarray

Time series.

required
power float

Power exponent for the kernel.

required

Returns:

Type Description
float

Test statistic.

spearmans_rho

spearmans_rho(x, y)

Spearman's rank correlation coefficient.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required

Returns:

Type Description
float

Spearman's rho in [-1, 1].

tick_rule_classify

tick_rule_classify(prices)

Classify trades using the tick rule.

Assigns +1 (uptick), -1 (downtick), or 0 (no change) to each trade.

Parameters:

Name Type Description Default
prices ndarray

Trade price series.

required

Returns:

Type Description
ndarray

Trade sign series (+1, -1, or 0).

variation_of_information

variation_of_information(x, y, n_bins=None, normalize=False)

Variation of information — a metric-space distance based on entropy.

Parameters:

Name Type Description Default
x ndarray

First variable.

required
y ndarray

Second variable.

required
n_bins int

Number of histogram bins.

None
normalize bool

If True, normalize to [0, 1] range.

False

Returns:

Type Description
float

Variation of information (non-negative).

vpin

vpin(volumes, prices, bucket_size, n_buckets)

Volume-Synchronized Probability of Informed Trading (VPIN).

Estimates the probability of informed trading from volume-bucketed data.

Parameters:

Name Type Description Default
volumes ndarray

Volume series.

required
prices ndarray

Price series.

required
bucket_size float

Volume per bucket.

required
n_buckets int

Number of buckets for the rolling VPIN estimate.

required

Returns:

Type Description
ndarray

VPIN estimates at each bucket boundary.