Removing Outliers from Training Data For XGBoost

XGBoost is generally robust to outliers.

Nevertheless, outliers in training data can significantly impact the performance and generalization of machine learning models, sometimes including those built with XGBoost

These anomalous data points, which deviate substantially from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results. Removing outliers from the training data is a simple yet effective technique to mitigate this issue and improve XGBoost model performance.

This example demonstrates how to identify and remove outliers from a dataset using the Isolation Forest algorithm from the scikit-learn library in Python, followed by training an XGBoost model on the cleaned data.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Generate a synthetic dataset with outliers
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=0,
                          n_repeated=0, n_classes=2, n_clusters_per_class=1,
                          class_sep=1.0, random_state=42)

# Add outliers to the dataset
np.random.seed(42)
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify and remove outliers using Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_mask = iso_forest.fit_predict(X_train) == 1
X_train, y_train = X_train[outlier_mask], y_train[outlier_mask]

# Train an XGBoost model on the cleaned data
model = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model's performance on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.4f}")

The code snippet above first prepares a synthetic dataset with outliers and splits it into train and test sets. The Isolation Forest algorithm is then applied to the training data to identify outliers. The contamination parameter is set to 0.05, meaning that approximately 5% of the data points will be classified as outliers. The outliers are removed from the training data using boolean indexing.

An XGBoost classifier is then instantiated and trained on the cleaned training data. Finally, the model’s performance is evaluated on the test set using the accuracy metric.

The Isolation Forest algorithm works by randomly selecting a feature and a split value between the minimum and maximum values of the selected feature. This process is recursively repeated to build a tree-like structure where anomalous data points are more likely to be isolated in shorter branches. The algorithm constructs an ensemble of such isolation trees and assigns an anomaly score to each data point based on the average path length required to isolate it across all trees.

While the Isolation Forest is a popular choice for outlier detection, other algorithms like Local Outlier Factor (LOF) and One-Class SVM can also be used depending on the dataset and problem at hand. LOF is density-based and can handle non-globular outliers, while One-Class SVM is particularly useful when the majority of the data belongs to a single class.

It is crucial to note that outliers should only be removed from the training data and not from the test set or real-world data on which the model will be applied. The test set should represent the original data distribution to ensure a realistic evaluation of the model’s performance.

By removing outliers from the training data, XGBoost models can learn more robust and generalizable patterns, leading to improved performance on unseen data. This simple preprocessing step can significantly enhance the quality of the trained models and should be considered as part of the standard data preparation pipeline for XGBoost and other machine learning algorithms.

See Also