pyfirst
Classes
Feature selector using FIRST |
Functions
|
Estimating Total Sobol' Indices from Data |
|
Factor Importance Ranking and Selection using Total (Sobol') indices |
Package Contents
- pyfirst.TotalSobolKNN(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, noise: bool, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None) numpy.ndarray[source]
Estimating Total Sobol’ Indices from Data
TotalSobolKNN provides consistent estimation of total Sobol’ indices (Sobol’, 2001) directly from scattered data. When the responses are noiseless (noise=False), it implements the Nearest-Neighbor estimator in Broto et al. (2020). When the responses are noisy (noise=True), it implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024).
- Parameters:
X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
noise (bool) – A logical indicating whether the responses are noisy.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.
- Returns:
A numeric vector for the total Sobol’ indices estimation.
- Return type:
np.ndarray
Notes
Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.
References
Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.
Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.
Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.
- pyfirst.FIRST(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None, verbose: bool = False) numpy.ndarray[source]
Factor Importance Ranking and Selection using Total (Sobol’) indices
FIRST provides factor importance ranking and selection directly from scattered data without any model fitting.
FIRST is a model-independent factor importance ranking and selection algorithm proposed in Huang and Joseph (2024). Factor importance is computed based on total Sobol’ indices (Sobol’, 2021), which is connected to the approximation error introduced by excluding the factor of interest (Huang and Joseph, 2024). The estimation procedure adapts from the Nearest-Neighbor estimator in Broto et al. (2020) to account for the noisy data. Integrating it with forward selection and backward elimination allows for factor selection.
- Parameters:
X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_forward (int, default=2) – The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. n_forward=2 is recommended. To run the complete forward selection, please set n_forward to the number of factors / predictors.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.
verbose (default = False) – A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration.
- Returns:
A numeric vector for the factor importance, with zero indicating that the factor is not important for predicting the response.
- Return type:
np.ndarray
Notes
FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run n_forward times to tradeoff between accuracy and efficiency. n_forward = 2 is recommended in Yu et al. (2020). To run the complete forward selection, please set n_forward to the number of factors / predictors. In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve.
Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.
For more details about FIRST, please see Huang and Joseph (2024).
References
Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.
Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.
Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.
Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.
Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.
- class pyfirst.SelectByFIRST(regression: bool = True, n_knn: int = None, approx_knn: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None, verbose: bool = False)[source]
Bases:
sklearn.feature_selection._base.SelectorMixin,sklearn.base.BaseEstimatorFeature selector using FIRST
This implements the feature selector class for FIRST (Huang and Joseph, 2024), a model-independent feature selection algorithm based on total Sobol’ indices (Sobol’, 2001).
- Parameters:
regression (bool, default=True) – A logical indicating whether the feature selector is for a regression or classification problem.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_forward (int, default=2) – The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. n_forward=2 is recommended.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.
verbose (default = False) – A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration.
- importance_
Factor importance, with zero indicating that the factor is not important for predicting the response.
- Type:
np.ndarray
Notes
FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run n_forward times to tradeoff between accuracy and efficiency. n_forward = 2 is recommended in (Yu et al., 2020). In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve.
The estimation of the importance is via an adaptation of the Nearest-Neighbor estimator of Broto et al. (2020) for the total Sobol’ indices. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells.
For more details about FIRST, please see Huang and Joseph (2024).
References
Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.
Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.
Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.
Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.
Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.
- n_knn
- approx_knn
- regression
- rescale
- n_forward
- n_jobs
- random_state
- verbose
- fit(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, n_mc: int = None, twin_mc: bool = False)[source]
Compute the factor importance from data
- Parameters:
X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
- Returns:
Returns the instance itself.
- Return type:
object
Notes
To further accelerate the importance computation, we support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.
References
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.