XGBoost Remove Least Important Features

XGBoost automatically selects the most important features during training, but in some cases, manually removing the least important features can still be beneficial.

This example investigates the effect of removing the least important features on XGBoost model training time and performance by comparing results with all features versus a subset of features.

Here’s an example that demonstrates this concept:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import xgboost as xgb
import time

# Generate a synthetic dataset with 20 features, 10 of which are informative
X, y = make_classification(n_samples=1000000, n_features=20, n_informative=10,
                           n_redundant=5, n_repeated=5, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost model with all features
start_time = time.perf_counter()
model_all = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_all.fit(X_train, y_train)
train_time_all = time.perf_counter() - start_time

# Evaluate the model's performance
y_pred_all = model_all.predict(X_test)
accuracy_all = accuracy_score(y_test, y_pred_all)
f1_all = f1_score(y_test, y_pred_all)

# Train an XGBoost model with a subset of features (least important features removed)
n_features_to_remove = 10
feature_importances = model_all.feature_importances_
least_important_features = feature_importances.argsort()[:n_features_to_remove]

X_train_subset = X_train[:, ~least_important_features]
X_test_subset = X_test[:, ~least_important_features]

start_time = time.perf_counter()
model_subset = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_subset.fit(X_train_subset, y_train)
train_time_subset = time.perf_counter() - start_time

# Evaluate the subset model's performance
y_pred_subset = model_subset.predict(X_test_subset)
accuracy_subset = accuracy_score(y_test, y_pred_subset)
f1_subset = f1_score(y_test, y_pred_subset)

# Print the results
print(f"Training time with all features: {train_time_all:.2f} seconds")
print(f"Training time with subset of features: {train_time_subset:.2f} seconds")
print(f"\nAccuracy with all features: {accuracy_all:.4f}")
print(f"Accuracy with subset of features: {accuracy_subset:.4f}")
print(f"\nF1-score with all features: {f1_all:.4f}")
print(f"F1-score with subset of features: {f1_subset:.4f}")

The results may look like the following

Training time with all features: 3.31 seconds
Training time with subset of features: 2.64 seconds

Accuracy with all features: 0.9567
Accuracy with subset of features: 0.9367

F1-score with all features: 0.9570
F1-score with subset of features: 0.9369

In this example, we generate a synthetic dataset with 20 features, 10 of which are informative, 5 are redundant, and 5 are repeated.

We train two XGBoost models: one with all features and another with a subset of features (least important features removed). We measure and compare the training times for both models and evaluate their performance using accuracy and F1-score.

The results show the impact of removing the least important features on training time and model performance. In some cases, removing the least important features may lead to faster training times without significantly affecting model performance. However, the extent of the impact may vary depending on the dataset and the number of features removed.

It’s important to note that while removing the least important features can be beneficial in certain situations, it should be done with caution. Removing too many features or important features can negatively impact model performance. Therefore, it’s recommended to experiment with different subsets of features and evaluate their impact on both training time and model performance to find the optimal balance for your specific use case.

See Also