TotalSobolKNN

We now demonstrate how to use TotalSobolKNN for estimating Total Sobol’ indices (Sobol’, 2001) from scattered data. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.

# %pip install pyfirst

Imports

import numpy as np
from pyfirst import TotalSobolKNN

Simulate Data

We simulate clean data from the Ishigami function

$$ y = f(X) = \sin(X_{1}) + 7\sin^2(X_{2}) + 0.1X_{3}^{4}\sin(X_{1}), $$

where the input \(X\) are independent features uniformly distributed on \([-\pi,\pi]^{3}\).

def ishigami(x):
    x = -np.pi + 2 * np.pi * x
    y = np.sin(x[0]) + 7 * np.sin(x[1])**2 + 0.1 * x[2]**4 * np.sin(x[0])
    return y

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)

Run TotalSobolKNN

TotalSobolKNN(X, y, noise=False)
array([0.56029525, 0.44909099, 0.24883668])

The analytical Total Sobol’ indices for the Ishigami function are 0.558, 0.442, 0.244 respectively. We can see that TotalSobolKNN yields accurate estimation. Since the data is clean/noiseless, TotalSobolKNN implements the Nearest-Neighbor estimator from Broto et al. (2020).

Speeding Up TotalSobolKNN

Parallel Computation

If multiple processors are available, TotalSobolKNN is supported to run in parallel for acceleration via the argument n_jobs.

TotalSobolKNN(X, y, noise=False, n_jobs=4)
array([0.56029525, 0.44909099, 0.24883668])

Using Subsamples

The use of subsamples to accelerate computation of the outer loop expectation is available via the argument n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost.

Random Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, random_state=43)
array([0.51637338, 0.43505694, 0.2499616 ])

Twinning Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, twin_mc=True, random_state=43)
array([0.55400658, 0.45723359, 0.26196488])

All the Tricks Together

Using all the speed-up tricks, we can easily run TotalSobolKNN on dataset with a million instances.

np.random.seed(43)
n = 1000000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)
TotalSobolKNN(X, y, noise=False, n_mc=5000, approx_knn=True, n_jobs=4, random_state=43)
array([0.56205949, 0.44260965, 0.23462873])

Noisy Data

We now look at the estimation performance on the noisy data \(y = f(X) + \epsilon\) where \(\epsilon\sim\mathcal{N}(0,1)\) is the random error. For noisy data, TotalSobolKNN implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2025), which corrects the bias by the Nearest-Neighbor estimator from Broto et al. (2020) when applied on noisy data.

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X) + np.random.normal(size=n)

TotalSobolKNN(X, y, noise=True)
array([0.56334545, 0.43502825, 0.23528718])

For more details about TotalSobolKNN, please Huang and Joseph (2025).

References

Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.