XGBoost is remarkably robust when working with datasets that have far more features than examples (P » N).
This property makes XGBoost a go-to choice for many real-world applications where high-dimensional data is common, such as text classification, genomics, and recommender systems.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# Generate a synthetic dataset with 1000 features and 100 examples
X, y = make_classification(n_samples=100, n_features=1000, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the XGBoost classifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2f}")
In this example:
We generate a synthetic dataset with 1000 features and only 100 examples using scikit-learn’s
make_classification
function.The data is split into training and testing sets using
train_test_split
.An
XGBClassifier
is initialized with 100 estimators and a learning rate of 0.1. These hyperparameters are chosen for illustration purposes and may need tuning for real-world datasets.The model is trained on the high-dimensional training data and then used to make predictions on the test set.
Finally, we evaluate the model’s accuracy using
accuracy_score
from scikit-learn.
Despite the challenging scenario of having 10 times more features than examples, XGBoost is able to learn a model that achieves high accuracy on the held-out test set. This robustness to high dimensionality is one of the key strengths of XGBoost, contributing to its popularity among data scientists and machine learning practitioners.