TotalSobolKNN

We now demonstrate how to use TotalSobolKNN for estimating Total Sobol’ indices (Sobol’, 2001) from scattered data. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.

# %pip install pyfirst

Imports

import numpy as np
from pyfirst import TotalSobolKNN

Simulate Data

We simulate clean data from the Ishigami function

$$ y = f(X) = \sin(X_{1}) + 7\sin^2(X_{2}) + 0.1X_{3}^{4}\sin(X_{1}), $$

where the input \(X\) are independent features uniformly distributed on \([-\pi,\pi]^{3}\).

def ishigami(x):
    x = -np.pi + 2 * np.pi * x
    y = np.sin(x[0]) + 7 * np.sin(x[1])**2 + 0.1 * x[2]**4 * np.sin(x[0])
    return y

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)

Run TotalSobolKNN

TotalSobolKNN(X, y, noise=False)
array([0.56029525, 0.44909099, 0.24883668])

The analytical Total Sobol’ indices for the Ishigami function are 0.558, 0.442, 0.244 respectively. We can see that TotalSobolKNN yields accurate estimation. Since the data is clean/noiseless, TotalSobolKNN implements the Nearest-Neighbor estimator from Broto et al. (2020).

Speeding Up TotalSobolKNN

Parallel Computation

If multiple processors are available, TotalSobolKNN is supported to run in parallel for acceleration via the argument n_jobs.

TotalSobolKNN(X, y, noise=False, n_jobs=4)
array([0.56029525, 0.44909099, 0.24883668])

Using Subsamples

The use of subsamples to accelerate computation of the outer loop expectation is available via the argument n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost.

Random Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, random_state=43)
array([0.51637338, 0.43505694, 0.2499616 ])

Twinning Subsamples

TotalSobolKNN(X, y, noise=False, n_mc=1000, twin_mc=True, random_state=43)
array([0.55400658, 0.45723359, 0.26196488])

All the Tricks Together

Using all the speed-up tricks, we can easily run TotalSobolKNN on dataset with a million instances.

np.random.seed(43)
n = 1000000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)
TotalSobolKNN(X, y, noise=False, n_mc=5000, approx_knn=True, n_jobs=4, random_state=43)
array([0.56205949, 0.44260965, 0.23462873])

Noisy Data

We now look at the estimation performance on the noisy data \(y = f(X) + \epsilon\) where \(\epsilon\sim\mathcal{N}(0,1)\) is the random error. For noisy data, TotalSobolKNN implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024), which corrects the bias by the Nearest-Neighbor estimator from Broto et al. (2020) when applied on noisy data.

np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X) + np.random.normal(size=n)

TotalSobolKNN(X, y, noise=True)
array([0.56334545, 0.43502825, 0.23528718])

For more details about TotalSobolKNN, please Huang and Joseph (2024).

References

Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.