XGBoosting Home | About | Contact | Examples

XGBoost Robust to More Features Than Examples (P>>N)

XGBoost is remarkably robust when working with datasets that have far more features than examples (P » N).

This property makes XGBoost a go-to choice for many real-world applications where high-dimensional data is common, such as text classification, genomics, and recommender systems.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset with 1000 features and 100 examples
X, y = make_classification(n_samples=100, n_features=1000, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the XGBoost classifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2f}")

In this example:

  1. We generate a synthetic dataset with 1000 features and only 100 examples using scikit-learn’s make_classification function.

  2. The data is split into training and testing sets using train_test_split.

  3. An XGBClassifier is initialized with 100 estimators and a learning rate of 0.1. These hyperparameters are chosen for illustration purposes and may need tuning for real-world datasets.

  4. The model is trained on the high-dimensional training data and then used to make predictions on the test set.

  5. Finally, we evaluate the model’s accuracy using accuracy_score from scikit-learn.

Despite the challenging scenario of having 10 times more features than examples, XGBoost is able to learn a model that achieves high accuracy on the held-out test set. This robustness to high dimensionality is one of the key strengths of XGBoost, contributing to its popularity among data scientists and machine learning practitioners.



See Also