{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SelectByFIRST\n", "\n", "We now demonstrate how to use FIRST for factor selection within the `sklearn.pipeline.Pipeline` via `SelectByFIRST` class. If you have not installed `pyfirst`, please uncomment and run `%pip install pyfirst` below before proceeding. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %pip install pyfirst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from pyfirst import SelectByFIRST\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import fetch_california_housing, load_breast_cancer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetch Data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "housing = fetch_california_housing()\n", "X = housing.data\n", "y = np.log(housing.target)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build Pipeline and Train" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('selector', SelectByFIRST(regression=True,approx_knn=True,random_state=43)),\n", " ('estimator', RandomForestRegressor(random_state=43))\n", "]).fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test $R^2$ vs Full Model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8536755908380326" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.845913188558511" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full = RandomForestRegressor(random_state=43).fit(X_train, y_train)\n", "full.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.00734335, 0. , 0. , 0. , 0. ,\n", " 0.01010999, 0.13941309, 0.15473792])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe['selector'].get_feature_importance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparable test $R^2$ is observed from the random forest model fitted on the 4 factors identified by FIRST versus the random forest model fitted on the entire set of factors. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fetch Data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "cancer = load_breast_cancer()\n", "X = cancer.data\n", "y = cancer.target\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build Pipeline and Train" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([\n", " ('selector', SelectByFIRST(regression=False,random_state=43)),\n", " ('estimator', RandomForestClassifier(random_state=43))\n", "]).fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test Accuracy vs Full Model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9736842105263158" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9912280701754386" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full = RandomForestClassifier(random_state=43).fit(X_train, y_train)\n", "full.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 0.04416144, 0.01358814, 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0.00339703, 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0.00339703, 0.0101911 , 0.14267543,\n", " 0. , 0. , 0. , 0. , 0. ])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe['selector'].get_feature_importance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparable test accuracy is observed from the random forest model fitted on the 6 factors identified by FIRST versus the random forest model fitted on the entire set of factors. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 2 }