Utilities

Overview

The fdfi.utils module provides helper functions and classes used across the FDFI package.

Input Validation

fdfi.utils.validate_input(X)[source]

Validate and convert input to numpy array.

Parameters:

X (array-like) – Input data to validate.

Returns:

Validated numpy array.

Return type:

numpy.ndarray

Raises:

ValueError – If input cannot be converted to a valid numpy array.

Data Sampling

fdfi.utils.sample_background(data, n_samples, random_state=None)[source]

Sample background data for explanations.

Parameters:
  • data (numpy.ndarray) – Full dataset to sample from.

  • n_samples (int) – Number of samples to draw.

  • random_state (int, optional) – Random seed for reproducibility.

Returns:

Sampled background data.

Return type:

numpy.ndarray

Feature Names

fdfi.utils.get_feature_names(data, feature_names=None)[source]

Get or generate feature names.

Parameters:
  • data (array-like) – Data to get feature names for.

  • feature_names (list, optional) – User-provided feature names.

Returns:

Feature names.

Return type:

list

Additivity Check

fdfi.utils.check_additivity(shap_values, predictions, base_value, tol=0.001)[source]

Check if SHAP values satisfy the additivity property.

The additivity property states that the sum of SHAP values plus the base value should equal the prediction.

Parameters:
  • shap_values (numpy.ndarray) – Feature importance values. Shape (n_samples, n_features).

  • predictions (numpy.ndarray) – Model predictions. Shape (n_samples,).

  • base_value (float) – Base value (expected output).

  • tol (float, default=1e-3) – Tolerance for checking equality.

Returns:

  • bool – Whether additivity is satisfied.

  • float – Maximum absolute difference.

Return type:

Tuple[bool, float]

This function verifies the SHAP additivity property:

\[\begin{split}f(x) = \\phi_0 + \\sum_{j=1}^{d} \\phi_j\end{split}\]

where \(\\phi_0\) is the base value and \(\\phi_j\) are the feature attributions.

Feature Type Detection

fdfi.utils.detect_feature_types(X, categorical_threshold=10, feature_types=None)[source]

Auto-detect feature types from data.

Returns a dict with:
  • ‘binary’, ‘categorical’, ‘continuous’ indices

  • ‘types’ array of labels per feature

  • ‘ranges’ array of ranges per feature (for normalization)

Parameters:
Return type:

dict

This function auto-detects whether features are binary, categorical, or continuous based on the data distribution.

Gower Distance

fdfi.utils.gower_cost_matrix(X, Z, feature_types, feature_ranges, feature_weights=None)[source]

Compute Gower distance matrix for mixed-type data.

Parameters:
Return type:

ndarray

Computes the Gower distance matrix for mixed-type data (continuous, binary, and categorical features). Used by EOTExplainer when cost_metric="gower" or cost_metric="auto".

Diagnostics Utilities

fdfi.utils.compute_latent_independence(Z, subset_size=None)[source]

Compute pairwise distance correlation (dCor) between latent dimensions.

Lower off-diagonal values indicate greater independence of latent factors.

Parameters:
  • Z (np.ndarray) – Latent representations. Shape (n_samples, n_latent_dims).

  • subset_size (int, optional) – If provided and n_samples > subset_size, randomly subsample for efficiency.

Returns:

  • dcor_matrix (np.ndarray) – Pairwise distance correlation matrix. Shape (d, d).

  • median_dcor (float) – Median of off-diagonal entries as a single independence score.

Return type:

Tuple[ndarray, float]

fdfi.utils.compute_mmd(X_real, X_generated, sigma=1.0, subset_size=None)[source]

Compute Maximum Mean Discrepancy (MMD) with a Gaussian RBF kernel.

Measures distributional distance between real and generated data. Lower values indicate better fidelity.

Parameters:
  • X_real (np.ndarray) – Real data. Shape (n_real, n_features).

  • X_generated (np.ndarray) – Generated or reconstructed data. Shape (n_gen, n_features).

  • sigma (float, default=1.0) – Bandwidth for the Gaussian kernel.

  • subset_size (int, optional) – If provided, subsample each dataset to this size for efficiency.

Returns:

Non-negative MMD score.

Return type:

float

Statistical Utilities

TwoComponentMixture

class fdfi.utils.TwoComponentMixture(n_components=2, random_state=0, min_samples=10)[source]

Bases: object

Fit a two-component Gaussian mixture and extract quantiles.

Used for: 1. Variance floor estimation (from raw stds) 2. Practical significance margin (from point estimates)

Parameters:
  • n_components (int)

  • random_state (int)

  • min_samples (int)

n_components: int = 2
random_state: int = 0
min_samples: int = 10
means_: ndarray = None
stds_: ndarray = None
weights_: ndarray = None
gmm_: object = None
method_used_: str = None
fit(x)[source]
Parameters:

x (ndarray)

Return type:

TwoComponentMixture

quantile(q, component='larger')[source]
Parameters:
Return type:

float

plot(x, ax=None, **kwargs)[source]
Parameters:

x (ndarray)

__init__(n_components=2, random_state=0, min_samples=10)
Parameters:
  • n_components (int)

  • random_state (int)

  • min_samples (int)

Return type:

None

The TwoComponentMixture class fits a two-component Gaussian mixture model and is used for:

  1. Variance floor estimation: Determining a minimum variance threshold for stable confidence intervals

  2. Practical significance margins: Estimating reasonable effect size thresholds

Example:

from fdfi.utils import TwoComponentMixture
import numpy as np

# Fit mixture to standard errors
se_values = np.array([0.01, 0.02, 0.15, 0.18, 0.20, 0.25])
mixture = TwoComponentMixture().fit(se_values)

# Get quantile from smaller component
floor = mixture.quantile(0.95, component="smaller")
print(f"Variance floor: {floor}")

# Visualize the fit
mixture.plot(se_values)