SelectByFIRST
We now demonstrate how to use FIRST for factor selection within the sklearn.pipeline.Pipeline via SelectByFIRST class. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.
# %pip install pyfirst
Imports
import numpy as np
from pyfirst import SelectByFIRST
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing, load_breast_cancer
Regression
Fetch Data
housing = fetch_california_housing()
X = housing.data
y = np.log(housing.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
Build Pipeline and Train
pipe = Pipeline([
('selector', SelectByFIRST(regression=True,approx_knn=True,random_state=43)),
('estimator', RandomForestRegressor(random_state=43))
]).fit(X_train, y_train)
Test \(R^2\) vs Full Model
pipe.score(X_test, y_test)
0.8536872726940326
full = RandomForestRegressor(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)
0.8459557819565909
pipe['selector'].get_feature_importance()
array([0.00734335, 0. , 0. , 0. , 0. ,
0.01010999, 0.13941309, 0.15473792])
Comparable test \(R^2\) is observed from the random forest model fitted on the 4 factors identified by FIRST versus the random forest model fitted on the entire set of factors.
Binary Classification
Fetch Data
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
Build Pipeline and Train
pipe = Pipeline([
('selector', SelectByFIRST(regression=False,random_state=43)),
('estimator', RandomForestClassifier(random_state=43))
]).fit(X_train, y_train)
Test Accuracy vs Full Model
pipe.score(X_test, y_test)
0.9736842105263158
full = RandomForestClassifier(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)
0.9912280701754386
pipe['selector'].get_feature_importance()
array([0. , 0.04416144, 0.01358814, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.00339703, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.00339703, 0.0101911 , 0.14267543,
0. , 0. , 0. , 0. , 0. ])
Comparable test accuracy is observed from the random forest model fitted on the 6 factors identified by FIRST versus the random forest model fitted on the entire set of factors.