XGBoost Feature Importance with get_score()

Importance

The get_score() method is a powerful tool provided by the XGBoost library that allows you to programmatically access the feature importance scores of your trained model.

By utilizing this method, you can gain insights into which features have the most significant impact on your model’s predictions.

In this example, we’ll demonstrate how to use get_score() with a real-world dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix objects
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
}

# Train the XGBoost model
model = xgb.train(params, dtrain, num_boost_round=100, evals=[(dtest, 'test')])

# Get feature importance scores
importance_scores = model.get_score(importance_type='weight')
print("Feature Importance Scores:")
for feature, score in importance_scores.items():
    print(f"{feature}: {score}")

In this example, we load the Breast Cancer Wisconsin dataset and split it into train and test sets. We then create DMatrix objects for XGBoost and set the model parameters. After training the model using xgb.train(), we use the get_score() method of the trained model object to obtain the feature importance scores.

By default, get_score() returns the feature importance scores as a dictionary, where the keys are the feature names (in this case, represented by their indices), and the values are the corresponding importance scores. The importance_type parameter is set to 'weight', which means the scores represent the number of times a feature is used to split the data across all trees.

The output of get_score() will look something like this:

Feature Importance Scores:
0: 10
1: 3
2: 8
...

To make the output more interpretable, you can sort the feature importance scores in descending order:

sorted_scores = sorted(importance_scores.items(), key=lambda x: x[1], reverse=True)
print("Feature Importance Scores (Sorted):")
for feature, score in sorted_scores:
    print(f"{feature}: {score}")

This will display the features in order of their importance, with the most important feature at the top.

The importance scores obtained from get_score() provide valuable information about the relative importance of each feature in the model. Features with higher scores have a greater influence on the model’s predictions. By examining these scores, you can gain insights into which features are most relevant to the problem at hand and potentially use this information for feature selection or further analysis.

Keep in mind that the interpretation of the importance scores may vary depending on the importance_type parameter. XGBoost offers several importance types, such as 'weight', 'gain', and 'cover', each providing a different perspective on feature importance.

By leveraging the get_score() method, you can easily access and utilize the feature importance information programmatically, enabling you to make data-driven decisions and improve your understanding of your XGBoost models.

See Also