XGBoost offers multiple methods to calculate feature importance, including the “total_gain” method, which measures the total gain of each feature across all splits in the model.
This example demonstrates how to configure XGBoost to use the “total_gain” method and retrieve the feature importance scores using scikit-learn’s XGBClassifier.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier with the importance_type parameter set to "total_gain"
model = XGBClassifier(n_estimators=100, learning_rate=0.1, importance_type="total_gain", random_state=42)
# Train the model
model.fit(X_train, y_train)
# Retrieve the "total_gain" feature importance scores
importance_scores = model.feature_importances_
# Print the feature importance scores along with feature names
for feature, score in zip(data.feature_names, importance_scores):
print(f"{feature}: {score}")
In this example, we load the Breast Cancer Wisconsin dataset and split it into train and test sets. We then create an instance of scikit-learn’s XGBClassifier
with the importance_type
parameter set to "total_gain"
. This configures XGBoost to calculate feature importance based on the total gain of each feature across all splits in the model.
After training the model on the training data, we retrieve the “total_gain” feature importance scores using the feature_importances_
attribute of the trained model. This attribute returns an array of importance scores, where each score corresponds to a feature in the dataset.
Finally, we print the feature importance scores along with their corresponding feature names using the feature_names
attribute of the loaded dataset.
The “total_gain” feature importance method is particularly useful when you want to understand the overall contribution of each feature to the model’s performance, considering both the frequency of splits and the magnitude of the gains. By setting the importance_type
parameter to "total_gain"
when creating an XGBoost model with scikit-learn, you can easily calculate and retrieve these importance scores, providing valuable insights into the relative significance of each feature in the model’s decision-making process.