{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FIRST\n", "\n", "We now demonstrate how to use `FIRST` for factor importance ranking and selection. If you have not installed `pyfirst`, please uncomment and run `%pip install pyfirst` below before proceeding. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# %pip install pyfirst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from pyfirst import FIRST\n", "from sklearn.datasets import make_friedman1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate Data\n", "\n", "We simulate noisy data from the Friedman function \n", "\n", "$$\n", " y = f(X) + \\epsilon = 10\\sin(\\pi X_{1}X_{2}) + 20(X_{3}-0.5)^2 + 10X_{4} + 5X_{5} + \\epsilon,\n", "$$\n", "\n", "where the input $X$ are independent features uniformly distributed on unit hypercube and $\\epsilon\\sim\\mathcal{N}(0,1)$ is independent of input $X$. Here only the first 5 features are used, and the remaining are independent of $y$. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "X, y = make_friedman1(n_samples=10000, n_features=10, noise=1.0, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run FIRST" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, `FIRST` identifies the first 5 factors as important, and the remaining are not taking part in the prediction of the response." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choice for Forward Selection\n", "\n", "`FIRST` belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run `n_forward` times to trade off between accuracy and efficiency. `n_forward = 2` is recommended in Yu et al. (2020) and it is the default choice. To run the complete forward selection (see code below), please set `n_forward` to the number of factors / predictors. In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting forward selection...\n", "\n", "Phase-1 Forward Selection...\n", "\n", "current selection: \n", "current variance explained: 0.000\n", "candidate to add: 0(5.257) 1(5.041) 2(1.962) 3(8.580) 4(2.028) 5(0.454) 6(0.044) 7(0.379) 8(0.000) 9(0.000)\n", "add candidate 3(8.580).\n", "\n", "current selection: 3\n", "current variance explained: 8.580\n", "candidate to add: 0(13.588) 1(13.179) 2(11.062) 4(11.476) 5(8.588) 6(8.312) 7(8.761)\n", "add candidate 0(13.588).\n", "\n", "current selection: 3 0\n", "current variance explained: 13.588\n", "candidate to add: 1(20.105) 2(16.034) 4(15.731) 5(13.345) 7(13.543)\n", "add candidate 1(20.105).\n", "\n", "current selection: 3 0 1\n", "current variance explained: 20.105\n", "candidate to add: 2(22.122) 4(21.959)\n", "add candidate 2(22.122).\n", "\n", "current selection: 3 0 1 2\n", "current variance explained: 22.122\n", "candidate to add: 4(23.788)\n", "add candidate 4(23.788).\n", "\n", "Phase-2 Forward Selection...\n", "\n", "current selection: 3 0 1 2 4\n", "current variance explained: 23.788\n", "candidate to add: 5(23.224) 6(23.214) 7(23.196) 8(23.198) 9(23.220)\n", "early termination since none of the candidates can be added in this phase.\n", "\n", "Starting backward elimination...\n", "\n", "current selection: 0 1 2 3 4\n", "current variance explained: 23.788\n", "candidate to remove: 0(17.487) 1(18.108) 2(21.959) 3(16.038) 4(22.122)\n" ] }, { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y, n_forward=X.shape[1], verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Speeding Up FIRST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parallel Computation\n", "\n", "If multiple processors are available, `FIRST` is supported to run in parallel for acceleration via the argument `n_jobs`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y, n_jobs=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Approximate Nearest-Neighbor Search\n", "\n", "`FIRST` requires many nearest-neighbor searches. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with approximate search (`approx_knn=True`) by the inverted file index (IVF) is also supported in the implementation. IVF reduces the search scope through first clustering data into Voronoi cells. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y, approx_knn=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Subsamples\n", "\n", "The use of subsamples to accelerate computation of the outer loop expectation is available via the argument `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Random Subsamples" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y, n_mc=1000, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Twinning Subsamples" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FIRST(X, y, n_mc=1000, twin_mc=True, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### All the Tricks Together\n", "\n", "Using all the speed-up tricks, we can easily run `FIRST` on dataset with a ***million*** instances." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ True, True, True, True, True, False, False, False, False,\n", " False])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = make_friedman1(n_samples=1000000, n_features=10, noise=1.0, random_state=43)\n", "FIRST(X, y, n_mc=1000, approx_knn=True, n_jobs=4, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more details about `FIRST`, please Huang and Joseph (2025)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics.\n", "\n", "Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.\n", " \n", "Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.\n", "\n", "Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.\n", " \n", "Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }