XGBoost for the Covertype Dataset

The Covertype dataset is a well-known dataset used for multiclass classification tasks.

It contains data on forest cover types from four wilderness areas in the Roosevelt National Forest of northern Colorado.

In this example, we’ll load the Covertype dataset using scikit-learn, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the Covertype dataset
covtype = fetch_covtype()
X, y = covtype.data, covtype.target

# Ensure class numbers start at 0
y = y - 1

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
print(f"Class Distributions: {Counter(y)}")

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='multi:softmax', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_covtype.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_covtype.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score
accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Running the example, you will see results similar to the following:

Dataset shape: (581012, 54)
Classes: [0 1 2 3 4 5 6]
Class Distributions: Counter({1: 283301, 0: 211840, 2: 35754, 6: 20510, 5: 17367, 4: 9493, 3: 2747})
Best score: 0.821
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Accuracy: 0.821

In this example, we first load the Covertype dataset using fetch_covtype() from scikit-learn. We print the dataset shape and unique class labels. Note that the feature names are not provided by the fetch_covtype() function.

Next, we split the data into train and test sets using train_test_split(). We define a parameter grid with common XGBoost hyperparameters, including max_depth, learning_rate, n_estimators, subsample, and colsample_bytree.

We create an instance of XGBClassifier with the objective set to ‘multi:softmax’ for multiclass classification and perform a grid search using GridSearchCV with 3-fold cross-validation. After fitting the grid search object, we print the best score and corresponding best parameters.

We access the best model using best_estimator_ and save it to a file named ‘best_model_covtype.ubj’ using the save_model() method. To demonstrate loading the saved model, we create a new XGBClassifier instance and load the saved model using load_model().

Finally, we use the loaded model to make predictions on the test set and print the accuracy score.

By following this approach, you can apply XGBoost to the Covertype dataset, perform hyperparameter tuning, save the best model, and use it for making predictions on new data.

See Also