XGBoost Drop Non-Predictive Input Features

When working with real-world datasets, it’s common to encounter columns that don’t contribute to the predictive power of a model. These non-predictive columns, such as ID fields or irrelevant features, can negatively impact model performance and increase computational overhead. In this example, we’ll demonstrate how to identify and remove non-predictive columns from a Pandas DataFrame before training an XGBoost model.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Generate a synthetic dataset with predictive and non-predictive columns
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=2, n_classes=2, random_state=42)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["id"] = range(1000)  # Adding a non-predictive ID column
df["target"] = y

# Identify the non-predictive columns
non_predictive_columns = ["id"]

# Remove the non-predictive columns using Pandas DataFrame's drop() function
df_cleaned = df.drop(columns=non_predictive_columns)

# Split the data into train and test sets
X = df_cleaned.drop(columns=["target"])
y = df_cleaned["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost model on the cleaned dataset
model = XGBClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model's performance on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.4f}")

Here’s a step-by-step breakdown:

We generate a synthetic dataset using scikit-learn’s make_classification function, creating a mix of predictive and non-predictive features.
We convert the dataset into a Pandas DataFrame and add a non-predictive “id” column to simulate a common scenario in real-world datasets.
We identify the non-predictive columns, in this case, the “id” column.
Using Pandas DataFrame’s drop() function, we remove the non-predictive columns from the DataFrame, creating a cleaned version of the dataset.
We split the cleaned dataset into train and test sets, separating the target variable from the features.
We train an XGBoost classifier on the cleaned training set.
We evaluate the model’s performance by making predictions on the test set and calculating the accuracy score.
Finally, we print the test accuracy to assess how well the model performs after removing the non-predictive columns.

By removing the non-predictive columns before training the XGBoost model, we ensure that the model focuses on the informative features and isn’t unnecessarily burdened by irrelevant data. This approach can lead to improved model performance and faster training times, especially when dealing with large datasets containing many non-predictive columns.

See Also