Utilities

Overview

The fdfi.utils module provides helper functions and classes used across the FDFI package. The most commonly useful symbols for end users are TwoComponentMixture (for understanding variance-floor and margin estimation), compute_latent_independence(), and compute_mmd(). The remaining helpers are used internally by the explainer classes.

Statistical Utilities

TwoComponentMixture

class fdfi.utils.TwoComponentMixture(n_components=2, random_state=0, min_samples=10)[source]

Bases: object

Two-component Gaussian mixture model for quantile estimation.

Fits a two-Gaussian mixture to a 1-D array of non-negative values and exposes component-specific quantile queries. Used internally by conf_int() for both variance-floor estimation (from raw standard errors) and practical-significance margin estimation (from point estimates).

When the number of samples is below min_samples, a robust fallback (median + MAD) is used instead of the full EM algorithm.

Parameters:

n_components (int, default=2) – Number of Gaussian components. Fixed at 2 in the current design.
random_state (int, default=0) – Random seed forwarded to sklearn.mixture.GaussianMixture.
min_samples (int, default=10) – Minimum number of samples required to attempt GMM fitting; below this threshold the robust fallback is used.

means\_

Component means after fitting.

Type:: np.ndarray of shape (2,)

stds\_

Component standard deviations after fitting.

Type:: np.ndarray of shape (2,)

weights\_

Component mixing weights after fitting.

Type:: np.ndarray of shape (2,)

gmm\_

Fitted sklearn GMM object; None when the robust fallback was used.

Type:: GaussianMixture or None

method_used\_

Either 'gmm' or 'robust', indicating which fitting path ran.

Type:: str

Examples

>>> import numpy as np
>>> from fdfi.utils import TwoComponentMixture
>>> rng = np.random.default_rng(0)
>>> se = np.abs(rng.normal(0.1, 0.05, size=100))
>>> mix = TwoComponentMixture().fit(se)
>>> floor = mix.quantile(0.95, component="smaller")
>>> print(f"variance floor: {floor:.4f}")

n_components: int = 2

random_state: int = 0

min_samples: int = 10

means_: ndarray = None

stds_: ndarray = None

weights_: ndarray = None

gmm_: object = None

method_used_: str = None

fit(x)[source]

Fit the two-component mixture to data.

Parameters:: x (np.ndarray) – 1-D array of values to fit. Multi-dimensional arrays are flattened automatically.
Returns:: self – The fitted instance (enables method chaining).
Return type:: TwoComponentMixture

quantile(q, component='larger')[source]

Return the q-th quantile of one mixture component.

Parameters:

q (float) – Quantile level in (0, 1).
component ({'larger', 'smaller'}, default='larger') – Which component to use. 'larger' selects the component with the higher mean (typically the signal component); 'smaller' selects the component with the lower mean (noise / floor).

Returns:

The requested quantile value.

Return type:

float

Raises:

ValueError – If fit() has not been called yet.

plot(x, ax=None, **kwargs)[source]

Plot a histogram of the data overlaid with the fitted component PDFs.

Parameters:

x (np.ndarray) – Original data values (used for the histogram).
ax (matplotlib.axes.Axes, optional) – Axes to draw on. A new figure is created when None.
**kwargs – Additional keyword arguments forwarded to matplotlib.axes.Axes.hist().

Returns:

ax – The axes containing the plot.

Return type:

matplotlib.axes.Axes

__init__(n_components=2, random_state=0, min_samples=10)

Parameters:

n_components (int)
random_state (int)
min_samples (int)

Return type:

None

Latent Independence (dCor)

fdfi.utils.compute_latent_independence(Z, subset_size=None)[source]

Compute pairwise distance correlation (dCor) between latent dimensions.

Lower off-diagonal values indicate greater independence of latent factors.

Parameters:

Z (np.ndarray) – Latent representations. Shape (n_samples, n_latent_dims).
subset_size (int, optional) – If provided and n_samples > subset_size, randomly subsample for efficiency.

Returns:

dcor_matrix (np.ndarray) – Pairwise distance correlation matrix. Shape (d, d).
median_dcor (float) – Median of off-diagonal entries as a single independence score.

Return type:

Tuple[ndarray, float]

Maximum Mean Discrepancy

fdfi.utils.compute_mmd(X_real, X_generated, sigma=1.0, subset_size=None)[source]

Compute Maximum Mean Discrepancy (MMD) with a Gaussian RBF kernel.

Measures distributional distance between real and generated data. Lower values indicate better fidelity.

Parameters:

X_real (np.ndarray) – Real data. Shape (n_real, n_features).
X_generated (np.ndarray) – Generated or reconstructed data. Shape (n_gen, n_features).
sigma (float, default=1.0) – Bandwidth for the Gaussian kernel.
subset_size (int, optional) – If provided, subsample each dataset to this size for efficiency.

Returns:

Non-negative MMD score.

Return type:

float

Feature Type Detection

fdfi.utils.detect_feature_types(X, categorical_threshold=10, feature_types=None)[source]

Auto-detect feature types from data.

Classifies each column of X as 'binary', 'categorical', or 'continuous' based on the number of unique values and whether they are integer-like.

Parameters:

X (np.ndarray of shape (n_samples, n_features)) – Input feature matrix.
categorical_threshold (int, default=10) – Features with at most this many unique values are considered categorical (provided they are integer-like); features with exactly 2 unique values are always classified as binary.
feature_types (np.ndarray of shape (n_features,), optional) – Pre-specified type labels ('binary', 'categorical', 'continuous') for each feature. When provided, auto-detection is skipped and this array is used directly.

Returns:

Dictionary with the following keys:

'binary'list of int: Column indices of binary features.
'categorical'list of int: Column indices of categorical features.
'continuous'list of int: Column indices of continuous features.
'types'np.ndarray of shape (n_features,): Per-feature type label strings.
'ranges'np.ndarray of shape (n_features,): Per-feature value range (used for Gower distance normalisation); always 1.0 for binary/categorical features.

Return type:

dict

Raises:

ValueError – If feature_types is provided but its length does not match the number of features.

Gower Distance

fdfi.utils.gower_cost_matrix(X, Z, feature_types, feature_ranges, feature_weights=None)[source]

Compute the Gower distance matrix between two sets of mixed-type samples.

Gower distance handles a mix of continuous, binary, and categorical features by normalising continuous differences by the feature range and using 0/1 indicator differences for discrete features.

Parameters:

X (np.ndarray of shape (n, d)) – First set of samples.
Z (np.ndarray of shape (m, d)) – Second set of samples.
feature_types (np.ndarray of shape (d,)) – Per-feature type labels; each element must be 'binary', 'categorical', or 'continuous'.
feature_ranges (np.ndarray of shape (d,)) – Per-feature value ranges for continuous-feature normalisation. Typically obtained from detect_feature_types().
feature_weights (np.ndarray of shape (d,), optional) – Non-negative weights for each feature. Normalised to sum to 1 internally. Defaults to uniform weights.

Returns:

C – Pairwise Gower distance matrix. Values are in [0, 1].

Return type:

np.ndarray of shape (n, m)

Computes the Gower distance matrix for mixed-type data (continuous, binary, and categorical features). Used by EOTExplainer when cost_metric="gower" or cost_metric="auto".

Internal Helpers

The following functions are used internally by the explainer classes. They are documented here for completeness but are not part of the stable public API.

fdfi.utils.validate_input(X)[source]

Validate and convert input to numpy array.

Parameters:: X (array-like) – Input data to validate.
Returns:: Validated numpy array.
Return type:: numpy.ndarray
Raises:: ValueError – If input cannot be converted to a valid numpy array.

fdfi.utils.sample_background(data, n_samples, random_state=None)[source]

Sample background data for explanations.

Parameters:

data (numpy.ndarray) – Full dataset to sample from.
n_samples (int) – Number of samples to draw.
random_state (int, optional) – Random seed for reproducibility.

Returns:

Sampled background data.

Return type:

numpy.ndarray

fdfi.utils.get_feature_names(data, feature_names=None)[source]

Get or generate feature names.

Parameters:

data (array-like) – Data to get feature names for.
feature_names (list, optional) – User-provided feature names.

Returns:

Feature names.

Return type:

list

fdfi.utils.convert_to_link(predictions, link='identity')[source]

Convert predictions using a link function.

Parameters:

predictions (numpy.ndarray) – Model predictions.
link (str, default="identity") – Link function to use. Options: “identity”, “logit”.

Returns:

Transformed predictions.

Return type:

numpy.ndarray

Raises:

ValueError – If link function is not recognized.

The following link functions are supported:

"identity": No transformation (default)
"logit": Logit transformation for probability outputs

fdfi.utils.check_additivity(shap_values, predictions, base_value, tol=0.001)[source]

Check if SHAP values satisfy the additivity property.

The additivity property states that the sum of SHAP values plus the base value should equal the prediction.

Parameters:

shap_values (numpy.ndarray) – Feature importance values. Shape (n_samples, n_features).
predictions (numpy.ndarray) – Model predictions. Shape (n_samples,).
base_value (float) – Base value (expected output).
tol (float, default=1e-3) – Tolerance for checking equality.

Returns:

bool – Whether additivity is satisfied.
float – Maximum absolute difference.

Return type:

Tuple[bool, float]