pyfirst ======= .. py:module:: pyfirst Classes ------- .. autoapisummary:: pyfirst.SelectByFIRST Functions --------- .. autoapisummary:: pyfirst.TotalSobolKNN pyfirst.ShapleySobolKNN pyfirst.FIRSTRank pyfirst.FIRST Package Contents ---------------- .. py:function:: TotalSobolKNN(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[pandas.Series, numpy.ndarray], noise: bool, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_jobs: int = 1, random_state: Union[int, numpy.random.RandomState] = None) -> numpy.ndarray Estimating Total Sobol' Indices from Data `TotalSobolKNN` provides consistent estimation of total Sobol' indices (Sobol', 2001) directly from scattered data. When the responses are noiseless (`noise=False`), it implements the Nearest-Neighbor estimator in Broto et al. (2020). When the responses are noisy (`noise=True`), it implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024). :param X: A pd.DataFrame or np.ndarray for the factors / predictors. :type X: pd.DataFrame or np.ndarray :param y: A pd.Series or np.ndarray for the responses. :type y: pd.Series or np.ndarray :param noise: A logical indicating whether the responses are noisy. :type noise: bool :param n_knn: The number of nearest-neighbor for the inner loop conditional variance estimation. `n_knn=2` is recommended for regression, and `n_knn=3` for binary classification. :type n_knn: int, default=None :param approx_knn: A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances. :type approx_knn: bool, default=False :param n_mc: The number of Monte Carlo samples for the outer loop expectation estimation. :type n_mc: int, default=None :param twin_mc: A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2. :type twin_mc: bool, default=False :param rescale: A logical indicating whether to standardize the factors / predictors. :type rescale: bool, default=True :param n_jobs: The number of jobs to run in parallel. `n_jobs=-1` means using all processors. :type n_jobs: int, default=1 :param random_state: A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples. :type random_state: int or RandomState instance, default=None :returns: A numeric vector for the total Sobol' indices estimation. :rtype: np.ndarray .. rubric:: Notes `Faiss` (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (`approx_knn=True`) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data. .. rubric:: References Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics. Sobol', I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280. Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281. Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610. .. py:function:: ShapleySobolKNN(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[pandas.Series, numpy.ndarray], noise: bool, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_jobs: int = 1, random_state: Union[int, numpy.random.RandomState] = None) -> numpy.ndarray Estimating Shapley Sobol' Effect from Data `ShapleySobolKNN` provides consistent estimation of Shapley Sobol' Effect (Owen, 2014; Song et al., 2016) directly from scattered data. When the responses are noiseless (`noise=False`), it implements the Nearest-Neighbor estimator in Broto et al. (2020). When the responses are noisy (`noise=True`), it implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024). :param X: A pd.DataFrame or np.ndarray for the factors / predictors. :type X: pd.DataFrame or np.ndarray :param y: A pd.Series or np.ndarray for the responses. :type y: pd.Series or np.ndarray :param noise: A logical indicating whether the responses are noisy. :type noise: bool :param n_knn: The number of nearest-neighbor for the inner loop conditional variance estimation. `n_knn=2` is recommended for regression, and `n_knn=3` for binary classification. :type n_knn: int, default=None :param approx_knn: A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances. :type approx_knn: bool, default=False :param n_mc: The number of Monte Carlo samples for the outer loop expectation estimation. :type n_mc: int, default=None :param twin_mc: A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2. :type twin_mc: bool, default=False :param rescale: A logical indicating whether to standardize the factors / predictors. :type rescale: bool, default=True :param n_jobs: The number of jobs to run in parallel. `n_jobs=-1` means using all processors. :type n_jobs: int, default=1 :param random_state: A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples. :type random_state: int or RandomState instance, default=None :returns: A numeric vector for the Shapley Sobol' effect estimation. :rtype: np.ndarray .. rubric:: Notes `Faiss` (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (`approx_knn=True`) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data. .. rubric:: References Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics. Owen, A. B. (2014), “Sobol’indices and Shapley value,” SIAM/ASA Journal on Uncertainty Quantification, 2, 245–251. Song, E., Nelson, B. L., & Staum, J. (2016), “Shapley effects for global sensitivity analysis: Theory and computation,” SIAM/ASA Journal on Uncertainty Quantification, 4, 1060-1083. Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281. Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610. .. py:function:: FIRSTRank(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[pandas.Series, numpy.ndarray], noise: bool, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_jobs: int = 1, random_state: Union[int, numpy.random.RandomState] = None) -> Dict Provides Factor Ranking via Cumulative Variance that can be explained. `FIRSTRank` provides factor ranking directly from scattered data via cumulative variance that can be explained. See Huang and Joseph (2025) for details. :param X: A pd.DataFrame or np.ndarray for the factors / predictors. :type X: pd.DataFrame or np.ndarray :param y: A pd.Series or np.ndarray for the responses. :type y: pd.Series or np.ndarray :param noise: A logical indicating whether the responses are noisy. :type noise: bool :param n_knn: The number of nearest-neighbor for the inner loop conditional variance estimation. `n_knn=2` is recommended for regression, and `n_knn=3` for binary classification. :type n_knn: int, default=None :param approx_knn: A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances. :type approx_knn: bool, default=False :param n_mc: The number of Monte Carlo samples for the outer loop expectation estimation. :type n_mc: int, default=None :param twin_mc: A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2. :type twin_mc: bool, default=False :param rescale: A logical indicating whether to standardize the factors / predictors. :type rescale: bool, default=True :param n_jobs: The number of jobs to run in parallel. `n_jobs=-1` means using all processors. :type n_jobs: int, default=1 :param random_state: A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples. :type random_state: int or RandomState instance, default=None :returns: - "ranking": np.ndarray Indices of the factors, ordered from most to least important, based on cumulative variance that can be explained. - "explained_variance": np.ndarray Cumulative variance explained by the factors in the ranking order. :rtype: A dictionary with the following keys .. rubric:: Notes `Faiss` (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (`approx_knn=True`) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data. .. rubric:: References Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics. Sobol', I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280. Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281. Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610. .. py:function:: FIRST(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[pandas.Series, numpy.ndarray], n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: Union[int, numpy.random.RandomState] = None, return_option: str = 'selection', verbose: bool = False) -> numpy.ndarray Factor Importance Ranking and Selection using Total (Sobol') indices `FIRST` provides factor importance ranking and selection directly from scattered data without any model fitting. `FIRST` is a model-independent factor importance ranking and selection algorithm proposed in Huang and Joseph (2024). Factor importance is computed based on total Sobol' indices (Sobol', 2021), which is connected to the approximation error introduced by excluding the factor of interest (Huang and Joseph, 2024). The estimation procedure adapts from the Nearest-Neighbor estimator in Broto et al. (2020) to account for the noisy data. Integrating it with forward selection and backward elimination allows for factor selection. :param X: A pd.DataFrame or np.ndarray for the factors / predictors. :type X: pd.DataFrame or np.ndarray :param y: A pd.Series or np.ndarray for the responses. :type y: pd.Series or np.ndarray :param n_knn: The number of nearest-neighbor for the inner loop conditional variance estimation. `n_knn=2` is recommended for regression, and `n_knn=3` for binary classification. :type n_knn: int, default=None :param approx_knn: A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances. :type approx_knn: bool, default=False :param n_mc: The number of Monte Carlo samples for the outer loop expectation estimation. :type n_mc: int, default=None :param twin_mc: A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2. :type twin_mc: bool, default=False :param rescale: A logical indicating whether to standardize the factors / predictors. :type rescale: bool, default=True :param n_forward: The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. `n_forward=2` is recommended. To run the complete forward selection, please set `n_forward` to the number of factors / predictors. :type n_forward: int, default=2 :param n_jobs: The number of jobs to run in parallel. `n_jobs=-1` means using all processors. :type n_jobs: int, default=1 :param random_state: A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples. :type random_state: int or RandomState instance, default=None :param return_option: The options for the output of `FIRST`. Default is `selection`, which returns a binary vector indicating the factor selection, with `True` for selected variables and `False` otherwise. Other options include: (i) `total`, which return a numeric vector for total Sobol' effect and (ii) `shapley`, which return a numeric vector for Shapley Sobol' effect. See Huang and Joseph (2024) for details. :type return_option: str, default='selection' :param verbose: A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration. :type verbose: bool, default=False :returns: A numeric vector for the factor selection, importance, or ranking depending on the value of the `return_option` argument. :rtype: np.ndarray .. rubric:: Notes `FIRST` belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run `n_forward` times to tradeoff between accuracy and efficiency. `n_forward = 2` is recommended in Yu et al. (2020). To run the complete forward selection, please set `n_forward` to the number of factors / predictors. In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve. `Faiss` (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (`approx_knn=True`) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data. For more details about `FIRST`, please see Huang and Joseph (2024). .. rubric:: References Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics. Sobol', I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280. Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716. Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314. Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281. Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610. .. py:class:: SelectByFIRST(regression: bool = True, n_knn: int = None, approx_knn: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: Union[int, numpy.random.RandomState] = None, importance_option: str = 'total', verbose: bool = False) Bases: :py:obj:`sklearn.feature_selection._base.SelectorMixin`, :py:obj:`sklearn.base.BaseEstimator` Feature selector using FIRST This implements the feature selector class for FIRST (Huang and Joseph, 2024), a model-independent feature selection algorithm based on total Sobol' indices (Sobol', 2001). :param regression: A logical indicating whether the feature selector is for a regression or classification problem. :type regression: bool, default=True :param n_knn: The number of nearest-neighbor for the inner loop conditional variance estimation. `n_knn=2` is recommended for regression, and `n_knn=3` for binary classification. :type n_knn: int, default=None :param approx_knn: A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances. :type approx_knn: bool, default=False :param rescale: A logical indicating whether to standardize the factors / predictors. :type rescale: bool, default=True :param n_forward: The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. `n_forward=2` is recommended. :type n_forward: int, default=2 :param n_jobs: The number of jobs to run in parallel. `n_jobs=-1` means using all processors. :type n_jobs: int, default=1 :param random_state: A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples. :type random_state: int or RandomState instance, default=None :param importance_option: The options for the feature importance. Default is `total`, which returns a numeric vector for total Sobol' effect and `shapley` for Shapley Sobol' effect. See Huang and Joseph (2024) for details. :type importance_option: str, default='total' :param verbose: A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration. :type verbose: default = False .. attribute:: importance_ Factor importance, with zero indicating that the factor is not important for predicting the response. :type: np.ndarray .. rubric:: Notes FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run `n_forward` times to tradeoff between accuracy and efficiency. `n_forward = 2` is recommended in (Yu et al., 2020). In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve. The estimation of the importance is via an adaptation of the Nearest-Neighbor estimator of Broto et al. (2020) for the total Sobol' indices. `Faiss` (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (`approx_knn=True`) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. For more details about FIRST, please see Huang and Joseph (2024). .. rubric:: References Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics. Sobol', I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280. Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716. Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314. Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36. Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281. .. py:attribute:: regression :value: True .. py:attribute:: n_knn :value: 2 .. py:attribute:: approx_knn :value: False .. py:attribute:: rescale :value: True .. py:attribute:: n_forward :value: 2 .. py:attribute:: n_jobs :value: 1 .. py:attribute:: random_state .. py:attribute:: verbose :value: False .. py:attribute:: importance_option :value: 'total' .. py:method:: fit(X: Union[pandas.DataFrame, numpy.ndarray], y: Union[pandas.Series, numpy.ndarray], n_mc: int = None, twin_mc: bool = False) Compute the factor importance from data :param X: A pd.DataFrame or np.ndarray for the factors / predictors. :type X: pd.DataFrame or np.ndarray :param y: A pd.Series or np.ndarray for the responses. :type y: pd.Series or np.ndarray :param n_mc: The number of Monte Carlo samples for the outer loop expectation estimation. :type n_mc: int, default=None :param twin_mc: A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2. :type twin_mc: bool, default=False :returns: Returns the instance itself. :rtype: object .. rubric:: Notes To further accelerate the importance computation, we support the use of subsamples by specifying `n_mc`. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data. .. rubric:: References Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610. .. py:method:: get_feature_importance() Get the feature importance :returns: A numeric vector for the factor importance, with zero indicating that the factor is not important for predicting the response. :rtype: np.ndarray .. py:method:: _get_support_mask() Get the boolean mask indicating which features are selected :returns: A boolean vector with True indicating the feature is selected. :rtype: np.ndarray