XGboost Remove Outliers With Z-Score Statistical Method

Outliers in training data can negatively impact the performance and generalization of XGBoost models.

These anomalous data points, which significantly deviate from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results. One common approach to identify and remove outliers is using the Z-score method, which measures how many standard deviations a data point is from the mean.

This example demonstrates how to detect and remove outliers from a dataset using the Z-score method, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed).

By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy and generalization.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Generate a synthetic dataset with outliers
np.random.seed(42)
X = np.random.normal(loc=0, scale=1, size=(1000, 10))
y = np.random.binomial(n=1, p=0.5, size=1000)

# Add outliers to the dataset
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))

# Calculate Z-scores for each data point
z_scores = np.abs((X - X.mean(axis=0)) / X.std(axis=0))

# Remove data points with Z-scores above a specified threshold (e.g., 3)
threshold = 3
outlier_mask = (z_scores < threshold).all(axis=1)
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Train XGBoost models on the original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)

# Evaluate the models' performance on the test set
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)

accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)

print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")

The code snippet first generates a synthetic dataset using NumPy’s random number generators and adds outliers to the dataset by sampling from a normal distribution with a larger scale parameter. Z-scores are then calculated for each data point using the mean and standard deviation of the feature values. Data points with Z-scores above a specified threshold (in this case, 3) are considered outliers and removed from the dataset.

Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.

By removing outliers from the training data, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding to remove them.

See Also