XGBoost offers multiple ways to calculate feature importance, one of which is the “gain” method.
This method measures the average gain of a feature when it is used in trees. In other words, it quantifies how much each feature contributes to the model’s performance.
This example demonstrates how to configure XGBoost to use the “gain” method and retrieve the feature importance scores using scikit-learn’s XGBClassifier
.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier with the importance_type parameter set to "gain"
model = XGBClassifier(n_estimators=100, learning_rate=0.1, importance_type="gain", random_state=42)
# Train the model
model.fit(X_train, y_train)
# Retrieve the "gain" feature importance scores
importance_scores = model.feature_importances_
# Print the feature importance scores along with feature names
for feature, score in zip(data.feature_names, importance_scores):
print(f"{feature}: {score}")
To begin, we load the Breast Cancer Wisconsin dataset and split it into train and test sets.
Next, we create an instance of scikit-learn’s XGBClassifier
with the importance_type
parameter set to "gain"
. This configures XGBoost to calculate feature importance based on the average gain of each feature when it is used in trees.
We then train the model on the training data using the fit()
method.
After training, we retrieve the “gain” feature importance scores using the feature_importances_
attribute of the trained model. This attribute returns an array of importance scores, where each score corresponds to a feature in the dataset.
Finally, we print the feature importance scores along with their corresponding feature names using the feature_names
attribute of the loaded dataset.
By setting the importance_type
parameter to "gain"
when creating an XGBoost model with scikit-learn, you can easily configure the model to calculate feature importance based on the average gain of each feature. The feature_importances_
attribute allows you to retrieve these scores after training, providing insights into the relative contribution of each feature to the model’s performance.