XGBoost provides several ways to calculate feature importance, including the “weight” method, which is based on the number of times a feature is used to split the data across all trees.
This example demonstrates how to configure XGBoost to use the “weight” method and retrieve the feature importance scores using scikit-learn’s implementation of XGBoost.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier with the importance_type parameter set to "weight"
model = XGBClassifier(n_estimators=100, learning_rate=0.1, importance_type="weight", random_state=42)
# Train the model
model.fit(X_train, y_train)
# Retrieve the "weight" feature importance scores
importance_scores = model.feature_importances_
# Print the feature importance scores along with feature names
for feature, score in zip(data.feature_names, importance_scores):
print(f"{feature}: {score}")
In this example, we load the Breast Cancer Wisconsin dataset and split it into train and test sets.
We then create an instance of scikit-learn’s XGBClassifier
with the importance_type
parameter set to "weight"
. This configures XGBoost to calculate feature importance based on the number of times a feature is used to split the data across all trees.
After training the model on the training data, we retrieve the “weight” feature importance scores using the feature_importances_
attribute of the trained model. This attribute returns an array of importance scores, where each score corresponds to a feature in the dataset.
Finally, we print the feature importance scores along with their corresponding feature names using the feature_names
attribute of the loaded dataset.
By setting the importance_type
parameter to "weight"
when creating an XGBoost model with scikit-learn, you can easily configure the model to calculate feature importance based on the number of times a feature is used to split the data. The feature_importances_
attribute allows you to retrieve these scores after training, providing insights into the relative importance of each feature in the model’s decision-making process.