When working with regression problems, the target variable is often a continuous value.
However, there are cases where you might want to predict integer values instead. This is particularly useful when dealing with count data or discrete outcomes.
XGBoost, although primarily used for continuous value prediction, can also handle integer value prediction by treating it as a regression problem.
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
# Load a regression dataset
X, y = load_diabetes(return_X_y=True)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an XGBRegressor model
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Convert target values to integers
y_train = y_train.astype(int)
y_test = y_test.astype(int)
# Fit the model on the training data
model.fit(X_train, y_train)
# Predict on the test set and round to nearest integer
predictions = model.predict(X_test)
integer_predictions = predictions.round().astype(int)
print("Predicted integers:\n", integer_predictions[:5]) # Print the first 5 samples
In this example, we start by loading a regression dataset using load_diabetes()
from scikit-learn. We then split the data into training and test sets.
Next, we create an instance of the XGBRegressor
model, specifying the number of estimators and the learning rate.
Before fitting the model, we convert the target values (y_train
and y_test
) to integers using astype(int)
. This step is crucial as it tells XGBoost that we are interested in predicting integer values.
We then fit the model on the training data using the fit()
method.
To generate predictions, we use the predict()
method on the test set. Since the model outputs float values, we round the predictions to the nearest integer using round()
and convert them to integers using astype(int)
.
It’s important to note that rounding the predictions introduces some error, especially for small float values. An alternative approach is to use a Poisson loss function, which is specifically designed for count data. However, this requires a different model configuration.