FIRST
We now demonstrate how to use FIRST for factor importance ranking and selection. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.
# %pip install pyfirst
Imports
import numpy as np
from pyfirst import FIRST
from sklearn.datasets import make_friedman1
Simulate Data
We simulate noisy data from the Friedman function
$$ y = f(X) + \epsilon = 10\sin(\pi X_{1}X_{2}) + 20(X_{3}-0.5)^2 + 10X_{4} + 5X_{5} + \epsilon, $$
where the input \(X\) are independent features uniformly distributed on unit hypercube and \(\epsilon\sim\mathcal{N}(0,1)\) is independent of input \(X\). Here only the first 5 features are used, and the remaining are independent of \(y\).
X, y = make_friedman1(n_samples=10000, n_features=10, noise=1.0, random_state=43)
Run FIRST
FIRST(X, y)
array([0.2648618 , 0.2387856 , 0.07689935, 0.32580823, 0.0700073 ,
0. , 0. , 0. , 0. , 0. ])
As expected, FIRST identifies the first 5 factors as important, and the remaining are not taking part in the prediction of the response.
Choice for Forward Selection
FIRST belongs to the class of forward-backward selection with early dropping algorithm (Borboudakis and Tsamardinos, 2019). In forward selection, each time we find the candidate that maximizes the output variance that can be explained. For candidates that cannot improve the variance explained conditional on the selected factors, they are removed from the candidate set. This forward selection step is run n_forward times to trade off between accuracy and efficiency. n_forward = 2 is recommended in Yu et al. (2020) and it is the default choice. To run the complete forward selection (see code below), please set n_forward to the number of factors / predictors. In backward elimination, we again remove one factor at a time, starting with the factor that can improve the explained variance most, till no factor can further improve.
FIRST(X, y, n_forward=X.shape[1], verbose=True)
y has more than two unique values, setting it to regression problem with suggested n_knn = 2.
Starting forward selection...
Phase-1 Forward Selection...
current selection:
current variance explained: 0.000
candidate to add: 0(5.257) 1(5.041) 2(1.962) 3(8.580) 4(2.028) 5(0.454) 6(0.044) 7(0.379) 8(0.000) 9(0.000)
add candidate 3(8.580).
current selection: 3
current variance explained: 8.580
candidate to add: 0(13.584) 1(13.179) 2(11.062) 4(11.474) 5(8.594) 6(8.312) 7(8.760)
add candidate 0(13.584).
current selection: 3 0
current variance explained: 13.584
candidate to add: 1(20.105) 2(16.034) 4(15.731) 5(13.345) 7(13.543)
add candidate 1(20.105).
current selection: 3 0 1
current variance explained: 20.105
candidate to add: 2(22.122) 4(21.959)
add candidate 2(22.122).
current selection: 3 0 1 2
current variance explained: 22.122
candidate to add: 4(23.788)
add candidate 4(23.788).
Phase-2 Forward Selection...
current selection: 3 0 1 2 4
current variance explained: 23.788
candidate to add: 5(23.224) 6(23.214) 7(23.196) 8(23.198) 9(23.220)
early termination since none of the candidates can be added in this phase.
Starting backward elimination...
current selection: 0 1 2 3 4
current variance explained: 23.788
candidate to remove: 0(17.487) 1(18.108) 2(21.959) 3(16.038) 4(22.122)
array([0.2648618 , 0.2387856 , 0.07689935, 0.32580823, 0.0700073 ,
0. , 0. , 0. , 0. , 0. ])
Speeding Up FIRST
Parallel Computation
If multiple processors are available, FIRST is supported to run in parallel for acceleration via the argument n_jobs.
FIRST(X, y, n_jobs=4)
array([0.2648618 , 0.2387856 , 0.07689935, 0.32580823, 0.0700073 ,
0. , 0. , 0. , 0. , 0. ])
Approximate Nearest-Neighbor Search
FIRST requires many nearest-neighbor searches. Faiss (Douze et al., 2024) is used for efficient nearest-neighbor search, with approximate search (approx_knn=True) by the inverted file index (IVF) is also supported in the implementation. IVF reduces the search scope through first clustering data into Voronoi cells.
FIRST(X, y, approx_knn=True)
array([0.2648618 , 0.23877103, 0.07689935, 0.32580823, 0.0700073 ,
0. , 0. , 0. , 0. , 0. ])
Using Subsamples
The use of subsamples to accelerate computation of the outer loop expectation is available via the argument n_mc. Both random and twinning (Vakayil and Joseph, 2022) subsamples are available, where twinning subsamples could provide better approximation for the full data at a higher computational cost.
Random Subsamples
FIRST(X, y, n_mc=1000, random_state=43)
array([0.26198021, 0.23528031, 0.07956943, 0.34468169, 0.06488056,
0. , 0. , 0. , 0. , 0. ])
Twinning Subsamples
FIRST(X, y, n_mc=1000, twin_mc=True, random_state=43)
array([0.26132392, 0.22754917, 0.06887766, 0.31153919, 0.06237415,
0. , 0. , 0. , 0. , 0. ])
All the Tricks Together
Using all the speed-up tricks, we can easily run FIRST on dataset with a million instances.
X, y = make_friedman1(n_samples=1000000, n_features=10, noise=1.0, random_state=43)
FIRST(X, y, n_mc=1000, approx_knn=True, n_jobs=4, random_state=43)
array([0.27360209, 0.30479748, 0.08815202, 0.34269827, 0.08091292,
0. , 0. , 0. , 0. , 0. ])
For more details about FIRST, please Huang and Joseph (2024).
References
Huang, C., & Joseph, V. R. (2024). Factor Importance Ranking and Selection using Total Indices. arXiv preprint arXiv:2401.00800.
Borboudakis, G., & Tsamardinos, I. (2019). Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1), 276-314.
Yu, K., Guo, X., Liu, L., Li, J., Wang, H., Ling, Z., & Wu, X. (2020). Causality-based feature selection: Methods and evaluations. ACM Computing Surveys (CSUR), 53(5), 1-36.
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E., Lomeli, M., Hosseini, L., & Jégou, H., (2024). The Faiss library. arXiv preprint arXiv:2401.08281.
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.