XGBoost Feature Selection with RFE

Recursive Feature Elimination (RFE) is a powerful method for selecting the most important features in a dataset, which can help improve model performance and reduce training time by focusing on the most relevant information.

This example demonstrates how to use RFE with XGBoost and compares the performance of models trained with and without feature selection.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
import time

# Generate a synthetic dataset with 100 features, 10 of which are informative
X, y = make_classification(n_samples=5000, n_features=100, n_informative=10,
                           n_redundant=30, n_repeated=10, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize an XGBoost model and an RFE object
model = XGBClassifier(random_state=42)
rfe = RFE(estimator=model, n_features_to_select=20)

# Fit the RFE object with the XGBoost model and the training data
rfe.fit(X_train, y_train)

# Train an XGBoost model with all features
start_time = time.perf_counter()
model_all = XGBClassifier(random_state=42)
model_all.fit(X_train, y_train)
all_features_time = time.perf_counter() - start_time

# Train an XGBoost model with the selected features from RFE
start_time = time.perf_counter()
model_selected = XGBClassifier(random_state=42)
model_selected.fit(X_train[:, rfe.support_], y_train)
selected_features_time = time.perf_counter() - start_time

# Make predictions on the test set with both models
y_pred_all = model_all.predict(X_test)
y_pred_selected = model_selected.predict(X_test[:, rfe.support_])

# Compare the performance of the models
accuracy_all = accuracy_score(y_test, y_pred_all)
accuracy_selected = accuracy_score(y_test, y_pred_selected)

print(f"Accuracy with all features: {accuracy_all:.4f}")
print(f"Accuracy with selected features: {accuracy_selected:.4f}")
print(f"Training time with all features: {all_features_time:.2f} seconds")
print(f"Training time with selected features: {selected_features_time:.2f} seconds")

This example:

Generates a synthetic dataset with 100 features, 10 of which are informative, 30 are redundant, and 10 are repeated.
Splits the data into train and test sets.
Initializes an XGBoost model and an RFE object set to select the 20 most important features.
Fits the RFE object with the XGBoost model and the training data.
Trains two XGBoost models: one with all features and one with the selected features from RFE.
Makes predictions on the test set with both models and compares their accuracy and training time.
Prints the selected features and their importance scores.

The output demonstrates that the model trained with the selected features achieves similar accuracy to the model trained with all features while requiring less training time. This highlights the effectiveness of RFE in identifying the most relevant features for the XGBoost model.

See Also