The feature_importances_
property on XGBoost models provides a straightforward way to access feature importance scores after training your model.
By utilizing this property, you can quickly gain insights into which features have the most significant impact on your model’s predictions without the need for additional computation.
In this example, we’ll demonstrate how to use feature_importances_
with a real-world dataset.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Load the Breast Cancer Wisconsin dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Get predictions on the test set
y_pred = model.predict(X_test)
# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2f}")
# Access feature importance scores
importance_scores = model.feature_importances_
# Print the scores along with feature names
for feature, score in zip(data.feature_names, importance_scores):
print(f"{feature}: {score}")
In this example, we load the Breast Cancer Wisconsin dataset and split it into train and test sets.
We then create an instance of scikit-learn’s XGBClassifier
and train it on the training data. After training, we use the model to make predictions on the test set and calculate the accuracy to ensure the model is performing well.
To access the feature importance scores, we simply use the feature_importances_
property of the trained model. This property returns an array of importance scores, where each score corresponds to a feature in the dataset.
We then print the importance scores along with their corresponding feature names using the feature_names
attribute of the loaded dataset.
The scores represent the relative importance of each feature, with higher scores indicating a greater influence on the model’s predictions.
To visualize the feature importances, you can create a bar plot using a library like Matplotlib:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(data.feature_names, importance_scores)
plt.xticks(rotation=90)
plt.xlabel("Features")
plt.ylabel("Importance Score")
plt.title("XGBoost Feature Importances")
plt.show()
This will display a bar plot with the feature names on the x-axis and their corresponding importance scores on the y-axis, providing a clear visual representation of the relative importances.
The feature_importances_
property is available on both the XGBClassifier
class and XGBRegressor
.
Below is an equivalent example of retrieving feature importances from a regression model:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
import numpy as np
# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBRegressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Get predictions on the test set
y_pred = model.predict(X_test)
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Test RMSE: {rmse:.2f}")
# Access feature importance scores
importance_scores = model.feature_importances_
# Print the scores along with feature names
for feature, score in zip(data.feature_names, importance_scores):
print(f"{feature}: {score}")
By leveraging the feature_importances_
property on XGBoost models, you can easily access and utilize feature importance information without the need for additional code. This information can be valuable for feature selection, model interpretation, and gaining insights into the key drivers of your model’s predictions.