Outliers in training data can negatively impact the performance and generalization of XGBoost models.
These anomalous data points, which significantly deviate from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results.
The Isolation Forest algorithm is an unsupervised method for identifying outliers that works by isolating anomalies in the data.
This example demonstrates how to use the Isolation Forest algorithm to detect and remove outliers from a dataset, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed).
By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy and generalization.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import IsolationForest
import numpy as np
import xgboost as xgb
# Generate synthetic dataset with outliers
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, random_state=42)
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))
# Use Isolation Forest to identify and remove outliers
iso_forest = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
outlier_labels = iso_forest.fit_predict(X)
outlier_mask = outlier_labels != -1
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)
# Train XGBoost models on original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)
model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)
# Evaluate models on test sets
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)
accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)
print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")
The code snippet first generates a synthetic dataset using scikit-learn’s make_classification
function and adds outliers to the dataset by sampling from a normal distribution with a larger scale parameter. An Isolation Forest model is then instantiated and fitted to the dataset. The model identifies outliers by assigning them a label of -1, while inliers are labeled 1. The outliers are removed from the dataset using a boolean mask.
Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.
By removing outliers from the training data using the Isolation Forest algorithm, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. This unsupervised approach to outlier detection and removal offers an alternative to methods like Z-score, which relies on statistical assumptions about the data distribution. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding on an outlier removal strategy.