SelectByFIRST

We now demonstrate how to use FIRST for factor selection within the sklearn.pipeline.Pipeline via SelectByFIRST class. If you have not installed pyfirst, please uncomment and run %pip install pyfirst below before proceeding.

# %pip install pyfirst

Imports

import numpy as np
from pyfirst import SelectByFIRST
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing, load_breast_cancer

Regression

Fetch Data

housing = fetch_california_housing()
X = housing.data
y = np.log(housing.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

Build Pipeline and Train

pipe = Pipeline([
    ('selector', SelectByFIRST(regression=True,approx_knn=True,random_state=43)),
    ('estimator', RandomForestRegressor(random_state=43))
]).fit(X_train, y_train)

Test \(R^2\) vs Full Model

pipe.score(X_test, y_test)

0.8536872726940326

full = RandomForestRegressor(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)

0.8459557819565909

pipe['selector'].get_feature_importance()

array([0.00734335, 0.        , 0.        , 0.        , 0.        ,
       0.01010999, 0.13941309, 0.15473792])

Comparable test \(R^2\) is observed from the random forest model fitted on the 4 factors identified by FIRST versus the random forest model fitted on the entire set of factors.

Binary Classification

Fetch Data

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

Build Pipeline and Train

pipe = Pipeline([
    ('selector', SelectByFIRST(regression=False,random_state=43)),
    ('estimator', RandomForestClassifier(random_state=43))
]).fit(X_train, y_train)

Test Accuracy vs Full Model

pipe.score(X_test, y_test)

0.9736842105263158

full = RandomForestClassifier(random_state=43).fit(X_train, y_train)
full.score(X_test, y_test)

0.9912280701754386

pipe['selector'].get_feature_importance()

array([0.        , 0.04416144, 0.01358814, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00339703, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00339703, 0.0101911 , 0.14267543,
       0.        , 0.        , 0.        , 0.        , 0.        ])

Comparable test accuracy is observed from the random forest model fitted on the 6 factors identified by FIRST versus the random forest model fitted on the entire set of factors.