TotalSobolKNN
We now demonstrate how to use TotalSobolKNN for estimating Total Sobol’ indices (Sobol’, 2001) from scattered data. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.
# %pip install pyfirst
Imports
import numpy as np
from pyfirst import TotalSobolKNN
Simulate Data
We simulate clean data from the Ishigami function
$$ y = f(X) = \sin(X_{1}) + 7\sin^2(X_{2}) + 0.1X_{3}^{4}\sin(X_{1}), $$
where the input \(X\) are independent features uniformly distributed on \([-\pi,\pi]^{3}\).
def ishigami(x):
x = -np.pi + 2 * np.pi * x
y = np.sin(x[0]) + 7 * np.sin(x[1])**2 + 0.1 * x[2]**4 * np.sin(x[0])
return y
np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)
Run TotalSobolKNN
TotalSobolKNN(X, y, noise=False)
array([0.56029525, 0.44909099, 0.24883668])
The analytical Total Sobol’ indices for the Ishigami function are 0.558, 0.442, 0.244 respectively. We can see that TotalSobolKNN yields accurate estimation. Since the data is clean/noiseless, TotalSobolKNN implements the Nearest-Neighbor estimator from Broto et al. (2020).
Speeding Up TotalSobolKNN
Parallel Computation
If multiple processors are available, TotalSobolKNN is supported to run in parallel for acceleration via the argument n_jobs.
TotalSobolKNN(X, y, noise=False, n_jobs=4)
array([0.56029525, 0.44909099, 0.24883668])
Approximate Nearest-Neighbor Search
TotalSobolKNN involves many nearest-neighbor searches. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with approximate search (approx_knn=True) by the inverted file index (IVF) is also supported in the implementation. IVF reduces the search scope through first clustering data into Voronoi cells.
TotalSobolKNN(X, y, noise=False, approx_knn=True)
array([0.56031995, 0.44917851, 0.24896301])
Using Subsamples
The use of subsamples to accelerate computation of the outer loop expectation is available via the argument n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost.
Random Subsamples
TotalSobolKNN(X, y, noise=False, n_mc=1000, random_state=43)
array([0.51637338, 0.43505694, 0.2499616 ])
Twinning Subsamples
TotalSobolKNN(X, y, noise=False, n_mc=1000, twin_mc=True, random_state=43)
array([0.55400658, 0.45723359, 0.26196488])
All the Tricks Together
Using all the speed-up tricks, we can easily run TotalSobolKNN on dataset with a million instances.
np.random.seed(43)
n = 1000000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X)
TotalSobolKNN(X, y, noise=False, n_mc=5000, approx_knn=True, n_jobs=4, random_state=43)
array([0.56205949, 0.44260965, 0.23462873])
Noisy Data
We now look at the estimation performance on the noisy data \(y = f(X) + \epsilon\) where \(\epsilon\sim\mathcal{N}(0,1)\) is the random error. For noisy data, TotalSobolKNN implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024), which corrects the bias by the Nearest-Neighbor estimator from Broto et al. (2020) when applied on noisy data.
np.random.seed(43)
n = 10000
p = 3
X = np.random.uniform(size=(n,p))
y = np.apply_along_axis(ishigami, 1, X) + np.random.normal(size=n)
TotalSobolKNN(X, y, noise=True)
array([0.56334545, 0.43502825, 0.23528718])
For more details about TotalSobolKNN, please Huang and Joseph (2024).
References
Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.
Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.
Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.