XGboost Remove Outliers With IQR Statistical Method

Outliers in training data can negatively impact the performance and generalization of XGBoost models.

These anomalous data points, which significantly deviate from the majority of the data, can skew the model’s learned parameters and lead to suboptimal results.

One common approach to identify and remove outliers is using the interquartile range (IQR) method, which measures the dispersion of the data and defines outliers as points falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the first and third quartiles, respectively.

This example demonstrates how to detect and remove outliers from a dataset using the IQR method, followed by training two XGBoost models—one on the original data (with outliers) and another on the cleaned data (outliers removed). By comparing the performance of these models, we can observe the impact of outliers on the model’s accuracy and generalization.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Generate a synthetic dataset with outliers
np.random.seed(42)
X = np.random.normal(loc=0, scale=1, size=(1000, 10))
y = np.random.binomial(n=1, p=0.5, size=1000)

# Add outliers to the dataset
outlier_indices = np.random.choice(len(X), size=50, replace=False)
X[outlier_indices] += np.random.normal(loc=0, scale=5, size=(50, 10))

# Calculate Q1, Q3, and IQR for each feature
Q1 = np.percentile(X, 25, axis=0)
Q3 = np.percentile(X, 75, axis=0)
IQR = Q3 - Q1

# Remove data points falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outlier_mask = ((X > lower_bound) & (X < upper_bound)).all(axis=1)
X_cleaned, y_cleaned = X[outlier_mask], y[outlier_mask]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Train XGBoost models on the original and cleaned data
model_original = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_original.fit(X_train, y_train)

model_cleaned = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model_cleaned.fit(X_train_cleaned, y_train_cleaned)

# Evaluate the models' performance on the test set
y_pred_original = model_original.predict(X_test)
y_pred_cleaned = model_cleaned.predict(X_test_cleaned)

accuracy_original = accuracy_score(y_test, y_pred_original)
accuracy_cleaned = accuracy_score(y_test_cleaned, y_pred_cleaned)

print(f"Test accuracy (with outliers): {accuracy_original:.4f}")
print(f"Test accuracy (outliers removed): {accuracy_cleaned:.4f}")

The code snippet first generates a synthetic dataset using NumPy’s random number generators and adds outliers to the dataset by sampling from a normal distribution with a larger scale parameter. The first quartile (Q1), third quartile (Q3), and interquartile range (IQR) are calculated for each feature using NumPy’s percentile function. Data points falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers and removed from the dataset.

Next, the original dataset (with outliers) and the cleaned dataset (outliers removed) are split into train and test sets. Two XGBoost classifiers are instantiated and trained on the respective training sets. Finally, the models’ performance is evaluated on the corresponding test sets using the accuracy metric, and the results are printed for comparison.

By removing outliers from the training data, the XGBoost model can learn more robust and generalizable patterns, potentially leading to improved performance on unseen data. However, the impact of outliers on model performance may vary depending on the dataset and the problem at hand. It is essential to carefully consider the nature of the outliers and the specific requirements of the application before deciding to remove them.

See Also