XGBoosting Home | About | Contact | Examples

XGboost Remove Outliers With One-Class SVM

Outliers in training data can negatively impact the performance and generalization of XGBoost models. These anomalous data points, which significantly deviate from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results.

One-Class SVM is an effective method for detecting outliers based on the concept of support vectors. This algorithm learns a decision boundary that encompasses the majority of the data points, while identifying points that fall outside this boundary as potential outliers.

This example demonstrates how to use One-Class SVM to identify and remove outliers from a dataset, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed). By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy and generalization.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.svm import OneClassSVM
import xgboost as xgb

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=8, n_redundant=2, random_state=42)

# Add outliers to the dataset
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))

# Identify outliers using the One-Class SVM algorithm
svm = OneClassSVM(nu=0.05, kernel="rbf", gamma=0.1)
outlier_scores = svm.fit_predict(X)

# Remove data points identified as outliers
outlier_mask = outlier_scores != -1
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Train XGBoost models on the original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)

# Evaluate the models' performance on the test set
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)

accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)

print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")

The code snippet first generates a synthetic dataset using scikit-learn’s make_classification function and adds outliers to the dataset by sampling from a normal distribution with a larger scale parameter. The One-Class SVM algorithm is then used to identify outliers based on the concept of support vectors. Data points with One-Class SVM scores of -1 are considered outliers and removed from the dataset.

Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.

By removing outliers from the training data using the One-Class SVM algorithm, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding on an outlier detection and removal strategy.



See Also