Use XGBoost Feature Importance for Feature Selection

Feature selection is a crucial step in machine learning, as it helps to reduce the dimensionality of the dataset, improve model performance, and increase interpretability.

XGBoost, a powerful gradient boosting library, provides built-in feature importance scores that can be used for feature selection.

This example demonstrates how to leverage XGBoost’s feature importance scores to select the most relevant features and train a model using only those features with scikit-learn.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier

# Generate a random dataset with 100 features
X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, n_redundant=90, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Get the feature importance scores
importance_scores = model.feature_importances_

# Select the top 10 most important features
selected_features = importance_scores.argsort()[-10:]

# Create a new XGBClassifier with the selected features
selected_model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the new model using only the selected features
selected_model.fit(X_train[:, selected_features], y_train)

# Evaluate the original model
y_pred = model.predict(X_test)
original_accuracy = accuracy_score(y_test, y_pred)

# Evaluate the model with selected features
y_pred_selected = selected_model.predict(X_test[:, selected_features])
selected_accuracy = accuracy_score(y_test, y_pred_selected)

print(f"Original Model Accuracy: {original_accuracy:.4f}")
print(f"Selected Features Model Accuracy: {selected_accuracy:.4f}")

# Using scikit-learn's SelectFromModel for feature selection
selector = SelectFromModel(model, prefit=True)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# Train a new model using the selected features
selected_model_pipeline = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
selected_model_pipeline.fit(X_train_selected, y_train)

# Evaluate the model with selected features using the pipeline
y_pred_pipeline = selected_model_pipeline.predict(X_test_selected)
selected_accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)

print(f"Selected Features Model Accuracy (Pipeline): {selected_accuracy_pipeline:.4f}")

In this example, we generate a random dataset with 100 features using scikit-learn’s make_classification function, where only 10 features are informative, and the remaining 90 are redundant.

We then train an XGBoost classifier on the dataset and retrieve the feature importance scores using the feature_importances_ attribute. Based on these scores, we select the top 10 most important features using argsort().

Next, we create a new XGBoost classifier and train it using only the selected features. We evaluate both the original model and the model trained with selected features using the accuracy_score metric.

Additionally, we demonstrate how to use scikit-learn’s SelectFromModel class along with a pipeline to streamline the feature selection process. We create a SelectFromModel object with the pre-trained XGBoost model and use it to transform the train and test datasets, selecting only the important features. We then train a new XGBoost classifier using the selected features and evaluate its performance.

By comparing the accuracy scores, we can assess the impact of feature selection on the model’s performance. In many cases, using only the most important features can lead to improved accuracy and reduced complexity.

This example showcases how XGBoost’s feature importance scores can be leveraged for effective feature selection, allowing data scientists and machine learning engineers to focus on the most relevant features and potentially enhance model performance.

See Also