TotalSobolKNN

We now demonstrate how to use TotalSobolKNN for estimating Total Sobol’ indices (Sobol’, 2001) from scattered data. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.

# %pip install pyfirst

Imports

import numpy as np
from pyfirst import TotalSobolKNN

Simulate Data

We simulate clean data from the Ishigami function

$$ y = f(X) = \sin(X_{1}) + 7\sin^2(X_{2}) + 0.1X_{3}^{4}\sin(X_{1}), $$

where the input $X$ are independent features uniformly distributed on $[-\pi,\pi]^{3}$.

def ishigami(x):
    x = -np.pi + 2 * np.pi * x
    y = np.sin(x[0]) + 7 * np.sin(x[1])**2 + 0.1 * x[2]**4 * np.sin(x[0])
    return y

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)

Run TotalSobolKNN

TotalSobolKNN(X, y, noise=False)

array([0.56029525, 0.44909099, 0.24883668])

The analytical Total Sobol’ indices for the Ishigami function are 0.558, 0.442, 0.244 respectively. We can see that TotalSobolKNN yields accurate estimation. Since the data is clean/noiseless, TotalSobolKNN implements the Nearest-Neighbor estimator from Broto et al. (2020).

Speeding Up TotalSobolKNN

Parallel Computation

If multiple processors are available, TotalSobolKNN is supported to run in parallel for acceleration via the argument n_jobs.

TotalSobolKNN(X, y, noise=False, n_jobs=4)

array([0.56029525, 0.44909099, 0.24883668])

Approximate Nearest-Neighbor Search

TotalSobolKNN involves many nearest-neighbor searches. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with approximate search (approx_knn=True) by the inverted file index (IVF) is also supported in the implementation. IVF reduces the search scope through first clustering data into Voronoi cells.

TotalSobolKNN(X, y, noise=False, approx_knn=True)

array([0.56031995, 0.44917851, 0.24896301])

Using Subsamples

The use of subsamples to accelerate computation of the outer loop expectation is available via the argument n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost.

Random Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, random_state=43)

array([0.51637338, 0.43505694, 0.2499616 ])

Twinning Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, twin_mc=True, random_state=43)

array([0.55400658, 0.45723359, 0.26196488])

All the Tricks Together

Using all the speed-up tricks, we can easily run TotalSobolKNN on dataset with a million instances.

np.random.seed(43)
n = 1000000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)

TotalSobolKNN(X, y, noise=False, n_mc=5000, approx_knn=True, n_jobs=4, random_state=43)

array([0.56205949, 0.44260965, 0.23462873])

Noisy Data

We now look at the estimation performance on the noisy data $y = f(X) + \epsilon$ where $\epsilon\sim\mathcal{N}(0,1)$ is the random error. For noisy data, TotalSobolKNN implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2025), which corrects the bias by the Nearest-Neighbor estimator from Broto et al. (2020) when applied on noisy data.

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X) + np.random.normal(size=n)

TotalSobolKNN(X, y, noise=True)

array([0.56334545, 0.43502825, 0.23528718])

For more details about TotalSobolKNN, please Huang and Joseph (2025).

References

Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.