Random Forest for Classification With XGBoost

Random forest is an ensemble learning method that constructs multiple decision trees and combines their predictions to improve classification accuracy.

XGBoost’s XGBRFClassifier class implements the random forest algorithm for binary and multi-class classification tasks, leveraging the power and efficiency of the XGBoost library.

This example demonstrates how to fit a random forest classifier using XGBRFClassifier on a synthetic binary classification dataset. We’ll generate the dataset, split it into train and test sets, define the model parameters, train the classifier, and evaluate its performance.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=10, n_informative=5, random_state=42)

# Split data into train and test sets
train_size = int(0.8 * len(X))
X_train, y_train = X[:train_size], y[:train_size]
X_test, y_test = X[train_size:], y[train_size:]

# Define XGBRFClassifier parameters
params = {
    'n_estimators': 100,
    'subsample': 0.8,
    'colsample_bynode': 0.8,
    'max_depth': 3,
    'random_state': 42
}

# Instantiate XGBRFClassifier with the defined parameters
model = xgb.XGBRFClassifier(**params)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Confusion Matrix:\n{confusion}")

In this example, we start by generating a synthetic binary classification dataset using sklearn.datasets.make_classification(). We then split the data into training and test sets.

Next, we define the XGBRFClassifier parameters in a dictionary. The 'n_estimators' parameter sets the number of trees in the forest, while 'subsample' and 'colsample_bynode' introduce randomness by sampling observations and features, respectively. The 'max_depth' parameter limits the depth of each tree.

We create an instance of the XGBRFClassifier with the defined parameters and train the model using the fit() method on the training data. After training, we make predictions on the test set using the predict() method.

Finally, we evaluate the model’s performance using the accuracy score and confusion matrix from sklearn.metrics. These metrics provide insights into the model’s effectiveness in classifying the binary targets.

By following this example, you can quickly fit an XGBoost random forest classifier using the XGBRFClassifier class, while controlling the model’s hyperparameters and evaluating its performance on a binary classification task.

See Also