pyfirst

Classes

SelectByFIRST

Feature selector using FIRST

Functions

`TotalSobolKNN`(→ numpy.ndarray)	Estimating Total Sobol' Indices from Data
`FIRST`(→ numpy.ndarray)	Factor Importance Ranking and Selection using Total (Sobol') indices

Package Contents

pyfirst.TotalSobolKNN(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, noise: bool, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None) → numpy.ndarray[source]

Estimating Total Sobol’ Indices from Data

TotalSobolKNN provides consistent estimation of total Sobol’ indices (Sobol’, 2001) directly from scattered data. When the responses are noiseless (noise=False), it implements the Nearest-Neighbor estimator in Broto et al. (2020). When the responses are noisy (noise=True), it implements the Noise-Adjusted Nearest-Neighbor estimator in Huang and Joseph (2024).

Parameters:

X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
noise (bool) – A logical indicating whether the responses are noisy.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.

Returns:

A numeric vector for the total Sobol’ indices estimation.

Return type:

np.ndarray

Notes

Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.

References

Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.

pyfirst.FIRST(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, n_knn: int = None, approx_knn: bool = False, n_mc: int = None, twin_mc: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None, verbose: bool = False) → numpy.ndarray[source]

Factor Importance Ranking and Selection using Total (Sobol’) indices

FIRST provides factor importance ranking and selection directly from scattered data without any model fitting.

FIRST is a model-independent factor importance ranking and selection algorithm proposed in Huang and Joseph (2024). Factor importance is computed based on total Sobol’ indices (Sobol’, 2021), which is connected to the approximation error introduced by excluding the factor of interest (Huang and Joseph, 2024). The estimation procedure adapts from the Nearest-Neighbor estimator in Broto et al. (2020) to account for the noisy data. Integrating it with forward selection and backward elimination allows for factor selection.

Parameters:

X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_forward (int, default=2) – The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. n_forward=2 is recommended. To run the complete forward selection, please set n_forward to the number of factors / predictors.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.
verbose (default = False) – A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration.

Returns:

A numeric vector for the factor importance, with zero indicating that the factor is not important for predicting the response.

Return type:

np.ndarray

Notes

FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run n_forward times to tradeoff between accuracy and efficiency. n_forward = 2 is recommended in Yu et al. (2020). To run the complete forward selection, please set n_forward to the number of factors / predictors. In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve.

Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells. To further accelerate, we also support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.

For more details about FIRST, please see Huang and Joseph (2024).

References

Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.

Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.

class pyfirst.SelectByFIRST(regression: bool = True, n_knn: int = None, approx_knn: bool = False, rescale: bool = True, n_forward: int = 2, n_jobs: int = 1, random_state: int | numpy.random.RandomState = None, verbose: bool = False)[source]

Bases: sklearn.feature_selection._base.SelectorMixin, sklearn.base.BaseEstimator

Feature selector using FIRST

This implements the feature selector class for FIRST (Huang and Joseph, 2024), a model-independent feature selection algorithm based on total Sobol’ indices (Sobol’, 2001).

Parameters:

regression (bool, default=True) – A logical indicating whether the feature selector is for a regression or classification problem.
n_knn (int, default=None) – The number of nearest-neighbor for the inner loop conditional variance estimation. n_knn=2 is recommended for regression, and n_knn=3 for binary classification.
approx_knn (bool, default=False) – A logical indicating whether to use approximate nearest-neighbor search, otherwise exact search is used. It is supported when there are at least 10,000 data instances.
rescale (bool, default=True) – A logical indicating whether to standardize the factors / predictors.
n_forward (int, default=2) – The number of times to run the forward selection phase to tradeoff between efficiency and accuracy. n_forward=2 is recommended.
n_jobs (int, default=1) – The number of jobs to run in parallel. n_jobs=-1 means using all processors.
random_state (int or RandomState instance, default=None) – A seed for controlling the randomness in breaking ties in nearest-neighbor search and finding random subsamples.
verbose (default = False) – A logical indicating whether to display intermediate results, e.g., the selected factor from each iteration.

importance_

Factor importance, with zero indicating that the factor is not important for predicting the response.

Type:: np.ndarray

Notes

FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run n_forward times to tradeoff between accuracy and efficiency. n_forward = 2 is recommended in (Yu et al., 2020). In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve.

The estimation of the importance is via an adaptation of the Nearest-Neighbor estimator of Broto et al. (2020) for the total Sobol’ indices. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with the approximate search (approx_knn=True) by the inverted file index (IVF). IVF reduces the search scope through first clustering data into Voronoi cells.

For more details about FIRST, please see Huang and Joseph (2024).

References

Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.

Sobol’, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and computers in simulation, 55(1-3), 271-280.

Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.

Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.

Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.

n_knn

approx_knn

regression

rescale

n_forward

n_jobs

random_state

verbose

fit(X: pandas.DataFrame | numpy.ndarray, y: pandas.Series | numpy.ndarray, n_mc: int = None, twin_mc: bool = False)[source]

Compute the factor importance from data

Parameters:

X (pd.DataFrame or np.ndarray) – A pd.DataFrame or np.ndarray for the factors / predictors.
y (pd.Series or np.ndarray) – A pd.Series or np.ndarray for the responses.
n_mc (int, default=None) – The number of Monte Carlo samples for the outer loop expectation estimation.
twin_mc (bool, default=False) – A logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.

Returns:

Returns the instance itself.

Return type:

object

Notes

To further accelerate the importance computation, we support the use of subsamples by specifying n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples provide better approximation for the full data.

References

Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.

get_feature_importance()[source]

Get the feature importance

Returns:: A numeric vector for the factor importance, with zero indicating that the factor is not important for predicting the response.
Return type:: np.ndarray

_get_support_mask()[source]

Get the boolean mask indicating which features are selected

Returns:: A boolean vector with True indicating the feature is selected.
Return type:: np.ndarray