XGBoost Evaluate Model using Train-Test Split With Native API

Evaluate

Splitting your data into training and testing sets is a fundamental technique for evaluating a model’s performance on unseen data. XGBoost’s native API provides a convenient function, train_test_split(), for dividing your data into these subsets, making it easy to train and evaluate your XGBoost model.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import xgboost as xgb
import numpy as np

# Load the diabetes dataset
X, y = load_diabetes(return_X_y=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix objects for training and testing sets
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Specify the XGBoost parameters
params = {
    'objective': 'reg:squarederror',
    'learning_rate': 0.1,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# Train the XGBoost model
num_rounds = 100
model = xgb.train(params, dtrain, num_rounds, evals=[(dtest, 'test')], early_stopping_rounds=10)

# Make predictions on the test set
y_pred = model.predict(dtest)

# Evaluate the model's performance
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
print(f"Test RMSE: {rmse:.2f}")

Here’s what’s happening:

We load the diabetes dataset and use train_test_split() to split the data into training and testing sets, specifying the test set size (20%) and a random seed for reproducibility.
We create DMatrix objects for the training and testing sets, which is the data structure used by XGBoost’s native API.
We specify the XGBoost parameters in a dictionary, including the objective function, learning rate, max depth, subsample, colsample_bytree, and random seed.
We train the XGBoost model using xgb.train() on the training set, specifying the parameters, training data, number of boosting rounds, and the evaluation set (test set). We also set early_stopping_rounds to avoid overfitting.
We make predictions on the test set using the trained model.
Finally, we evaluate the model’s performance on the test set using the root mean squared error (RMSE) metric.

By using XGBoost’s native API for model training, you can streamline your code and take advantage of the library’s built-in functionalities for efficient model development and evaluation.

See Also