Outliers in training data can significantly impact the performance of XGBoost models, leading to suboptimal results and poor generalization.
The Elliptic Envelope method is a robust technique for detecting outliers based on the assumption that the majority of the data follows an elliptical distribution.
This example demonstrates how to use the Elliptic Envelope method to identify and remove outliers from a dataset, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed).
By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.covariance import EllipticEnvelope
import xgboost as xgb
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2, random_state=42)
# Add outliers to the dataset
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))
# Identify outliers using the Elliptic Envelope method
ee = EllipticEnvelope(contamination=0.05)
outlier_scores = ee.fit_predict(X)
# Remove data points identified as outliers
outlier_mask = outlier_scores != -1
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)
# Train XGBoost models on the original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)
model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)
# Evaluate the models' performance on the test set
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)
accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)
print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")
The code snippet first generates a synthetic dataset using scikit-learn’s make_classification
function and adds outliers by sampling from a normal distribution with a larger scale parameter. The Elliptic Envelope method is then used to identify outliers based on the assumption that the majority of the data follows an elliptical distribution. Data points with scores of -1 are considered outliers and removed from the dataset.
Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.
By removing outliers from the training data using the Elliptic Envelope method, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding on an outlier detection and removal strategy.