XGboost Remove Outliers With Local Outlier Factor

Outliers in training data can negatively impact the performance and generalization of XGBoost models.

These anomalous data points, which significantly deviate from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results.

The Local Outlier Factor (LOF) algorithm is an effective method for detecting outliers based on the density of each data point’s neighborhood.

This example demonstrates how to use LOF to identify and remove outliers from a dataset, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed). By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy and generalization.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.neighbors import LocalOutlierFactor
import xgboost as xgb

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2, random_state=42)

# Add outliers to the dataset
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))

# Identify outliers using the Local Outlier Factor (LOF) algorithm
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
outlier_scores = lof.fit_predict(X)

# Remove data points identified as outliers
outlier_mask = outlier_scores != -1
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Train XGBoost models on the original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)

# Evaluate the models' performance on the test set
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)

accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)

print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")

The code snippet first generates a synthetic dataset using scikit-learn’s make_classification function and adds outliers to the dataset by sampling from a normal distribution with a larger scale parameter. The Local Outlier Factor (LOF) algorithm is then used to identify outliers based on the density of each data point’s neighborhood. Data points with LOF scores of -1 are considered outliers and removed from the dataset.

Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.

By removing outliers from the training data using the LOF algorithm, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding on an outlier detection and removal strategy.

See Also