Utilities
Overview
The fdfi.utils module provides helper functions and classes used across
the FDFI package.
Input Validation
- fdfi.utils.validate_input(X)[source]
Validate and convert input to numpy array.
- Parameters:
X (array-like) – Input data to validate.
- Returns:
Validated numpy array.
- Return type:
- Raises:
ValueError – If input cannot be converted to a valid numpy array.
Data Sampling
- fdfi.utils.sample_background(data, n_samples, random_state=None)[source]
Sample background data for explanations.
- Parameters:
data (numpy.ndarray) – Full dataset to sample from.
n_samples (int) – Number of samples to draw.
random_state (int, optional) – Random seed for reproducibility.
- Returns:
Sampled background data.
- Return type:
Feature Names
Link Functions
- fdfi.utils.convert_to_link(predictions, link='identity')[source]
Convert predictions using a link function.
- Parameters:
predictions (numpy.ndarray) – Model predictions.
link (str, default="identity") – Link function to use. Options: “identity”, “logit”.
- Returns:
Transformed predictions.
- Return type:
- Raises:
ValueError – If link function is not recognized.
The following link functions are supported:
"identity": No transformation (default)"logit": Logit transformation for probability outputs
Additivity Check
- fdfi.utils.check_additivity(shap_values, predictions, base_value, tol=0.001)[source]
Check if SHAP values satisfy the additivity property.
The additivity property states that the sum of SHAP values plus the base value should equal the prediction.
- Parameters:
shap_values (numpy.ndarray) – Feature importance values. Shape (n_samples, n_features).
predictions (numpy.ndarray) – Model predictions. Shape (n_samples,).
base_value (float) – Base value (expected output).
tol (float, default=1e-3) – Tolerance for checking equality.
- Returns:
bool – Whether additivity is satisfied.
float – Maximum absolute difference.
- Return type:
This function verifies the SHAP additivity property:
where \(\\phi_0\) is the base value and \(\\phi_j\) are the feature attributions.
Feature Type Detection
- fdfi.utils.detect_feature_types(X, categorical_threshold=10, feature_types=None)[source]
Auto-detect feature types from data.
- Returns a dict with:
‘binary’, ‘categorical’, ‘continuous’ indices
‘types’ array of labels per feature
‘ranges’ array of ranges per feature (for normalization)
This function auto-detects whether features are binary, categorical, or continuous based on the data distribution.
Gower Distance
- fdfi.utils.gower_cost_matrix(X, Z, feature_types, feature_ranges, feature_weights=None)[source]
Compute Gower distance matrix for mixed-type data.
Computes the Gower distance matrix for mixed-type data (continuous, binary,
and categorical features). Used by EOTExplainer when cost_metric="gower"
or cost_metric="auto".
Diagnostics Utilities
- fdfi.utils.compute_latent_independence(Z, subset_size=None)[source]
Compute pairwise distance correlation (dCor) between latent dimensions.
Lower off-diagonal values indicate greater independence of latent factors.
- Parameters:
Z (np.ndarray) – Latent representations. Shape (n_samples, n_latent_dims).
subset_size (int, optional) – If provided and n_samples > subset_size, randomly subsample for efficiency.
- Returns:
dcor_matrix (np.ndarray) – Pairwise distance correlation matrix. Shape (d, d).
median_dcor (float) – Median of off-diagonal entries as a single independence score.
- Return type:
- fdfi.utils.compute_mmd(X_real, X_generated, sigma=1.0, subset_size=None)[source]
Compute Maximum Mean Discrepancy (MMD) with a Gaussian RBF kernel.
Measures distributional distance between real and generated data. Lower values indicate better fidelity.
- Parameters:
X_real (np.ndarray) – Real data. Shape (n_real, n_features).
X_generated (np.ndarray) – Generated or reconstructed data. Shape (n_gen, n_features).
sigma (float, default=1.0) – Bandwidth for the Gaussian kernel.
subset_size (int, optional) – If provided, subsample each dataset to this size for efficiency.
- Returns:
Non-negative MMD score.
- Return type:
Statistical Utilities
TwoComponentMixture
- class fdfi.utils.TwoComponentMixture(n_components=2, random_state=0, min_samples=10)[source]
Bases:
objectFit a two-component Gaussian mixture and extract quantiles.
Used for: 1. Variance floor estimation (from raw stds) 2. Practical significance margin (from point estimates)
The TwoComponentMixture class fits a two-component Gaussian mixture model
and is used for:
Variance floor estimation: Determining a minimum variance threshold for stable confidence intervals
Practical significance margins: Estimating reasonable effect size thresholds
Example:
from fdfi.utils import TwoComponentMixture
import numpy as np
# Fit mixture to standard errors
se_values = np.array([0.01, 0.02, 0.15, 0.18, 0.20, 0.25])
mixture = TwoComponentMixture().fit(se_values)
# Get quantile from smaller component
floor = mixture.quantile(0.95, component="smaller")
print(f"Variance floor: {floor}")
# Visualize the fit
mixture.plot(se_values)