When tackling multi-class classification problems with XGBoost, we must properly configure the num_class
parameter, especially when using the multi:softmax
or multi:softprob
objective function.
This parameter specifies the number of classes in your target variable, enabling XGBoost to structure its output accordingly.
All class label integer values must be in [0, num_class), e.g. 0, 1, or 2 for a 3 class problem.
Here’s an example of setting num_class
when using XGBClassifier
from the scikit-learn API:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# Generate a synthetic multi-class dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=3, n_redundant=1, random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the XGBoost classifier with "multi:softmax" objective and num_class set to 3
model = XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='mlogloss')
# Fit the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
Note, the num_class
does not appear to be required or used when specified in the XGBClassifier
class constructor.
For example, setting a value of 0 or a value greater or lesser than the number of classes does not raise an error or change the output of the model.
The num_class
is required when using the native XGBoost API with either the multi:softmax
or multi:softprob
objective function.
Here’s the same example of setting num_class
when using the native XGBoost API:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import xgboost as xgb
# Generate a synthetic multi-class dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=3, n_redundant=1, random_state=42)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create DMatrix objects for the data
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Set the parameters including "multi:softmax" objective and num_class
params = {
'objective': 'multi:softmax',
'num_class': 3,
'eval_metric': 'mlogloss'
}
# Train the model
model = xgb.train(params, dtrain)
# Make predictions
predictions = model.predict(dtest)
The num_class
parameter tells XGBoost how many output nodes are needed in the final layer of the model.
It’s essential to set this value correctly to match the number of unique classes in your target variable.
Keep in mind that num_class
is not necessary for binary classification tasks and is only required for multi-class problems.