{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TotalSobolKNN\n", "\n", "We now demonstrate how to use `TotalSobolKNN` for estimating Total Sobol' indices (Sobol', 2001) from scattered data. If you have not installed `pyfirst`, please uncomment and run `%pip install pyfirst` below before proceeding. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# %pip install pyfirst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from pyfirst import TotalSobolKNN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate Data\n", "\n", "We simulate clean data from the Ishigami function \n", "\n", "$$\n", " y = f(X) = \\sin(X_{1}) + 7\\sin^2(X_{2}) + 0.1X_{3}^{4}\\sin(X_{1}),\n", "$$\n", "\n", "where the input $X$ are independent features uniformly distributed on $[-\\pi,\\pi]^{3}$." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def ishigami(x):\n", " x = -np.pi + 2 * np.pi * x\n", " y = np.sin(x[0]) + 7 * np.sin(x[1])**2 + 0.1 * x[2]**4 * np.sin(x[0])\n", " return y\n", "\n", "np.random.seed(43)\n", "n = 10000\n", "p = 3\n", "X = np.random.uniform(size=(n,p))\n", "y = np.apply_along_axis(ishigami, 1, X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run TotalSobolKNN" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.56029532, 0.44909099, 0.24883678])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The analytical Total Sobol' indices for the Ishigami function are 0.558, 0.442, 0.244 respectively. We can see that `TotalSobolKNN` yields accurate estimation. Since the data is clean/noiseless, `TotalSobolKNN` implements the Nearest-Neighbor estimator from Broto et al. (2020)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Speeding Up TotalSobolKNN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parallel Computation\n", "\n", "If multiple processors are available, `TotalSobolKNN` is supported to run in parallel for acceleration via the argument `n_jobs`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.56029532, 0.44909099, 0.24883678])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False, n_jobs=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Approximate Nearest-Neighbor Search\n", "\n", "`TotalSobolKNN` involves many nearest-neighbor searches. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with approximate search (`approx_knn=True`) by the inverted file index (IVF) is also supported in the implementation. IVF reduces the search scope through first clustering data into Voronoi cells. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.56031995, 0.44917851, 0.24896301])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False, approx_knn=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Subsamples\n", "\n", "The use of subsamples to accelerate computation of the outer loop expectation is available via the argument `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Random Subsamples" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.51637338, 0.43505694, 0.2499616 ])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False, n_mc=1000, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Twinning Subsamples" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.55400658, 0.45723359, 0.26196488])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False, n_mc=1000, twin_mc=True, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### All the Tricks Together\n", "\n", "Using all the speed-up tricks, we can easily run `TotalSobolKNN` on dataset with a ***million*** instances." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "np.random.seed(43)\n", "n = 1000000\n", "p = 3\n", "X = np.random.uniform(size=(n,p))\n", "y = np.apply_along_axis(ishigami, 1, X)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.56205949, 0.44260965, 0.23462873])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TotalSobolKNN(X, y, noise=False, n_mc=5000, approx_knn=True, n_jobs=4, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Noisy Data\n", "\n", "We now look at the estimation performance on the noisy data $y = f(X) + \\epsilon$ where $\\epsilon\\sim\\mathcal{N}(0,1)$ is the random error. For noisy data, `TotalSobolKNN` implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2025), which corrects the bias by the Nearest-Neighbor estimator from Broto et al. (2020) when applied on noisy data." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.56334077, 0.43502825, 0.23528587])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.seed(43)\n", "n = 10000\n", "p = 3\n", "X = np.random.uniform(size=(n,p))\n", "y = np.apply_along_axis(ishigami, 1, X) + np.random.normal(size=n)\n", "\n", "TotalSobolKNN(X, y, noise=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details about `TotalSobolKNN`, please Huang and Joseph (2025)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics.\n", "\n", "Sobol', I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.\n", " \n", "Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.\n", "\n", "Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.\n", " \n", "Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }