XGBoosting Home | About | Contact | Examples

Train XGBoost with Sparse Array

XGBoost can efficiently handle sparse datasets, which is particularly useful when dealing with high-dimensional data where many features are zero. By leveraging sparse data structures, you can significantly reduce memory usage and training time.

Here’s a quick example of how to train an XGBoost model using a sparse NumPy array:

# XGBoosting.com
# Train an XGBoost Model with Sparse NumPy Arrays
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from scipy.sparse import csr_matrix

# Generate a synthetic sparse dataset
X, y = make_classification(n_samples=10000, n_features=100, n_informative=10,
                           n_redundant=20, n_classes=2, random_state=42,
                           shuffle=False)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBClassifier with parameters for handling sparse data
model = XGBClassifier(tree_method='auto', enable_categorical=False)

# Convert the sparse matrix to a CSR matrix for XGBoost compatibility
X_train_csr = csr_matrix(X_train)
X_test_csr = csr_matrix(X_test)

# Train the model
model.fit(X_train_csr, y_train)

# Make predictions on the sparse test set
predictions = model.predict(X_test_csr)
print(predictions[:5])

In this example:

  1. We generate a synthetic sparse dataset using scikit-learn’s make_classification function.
  2. We initialize an XGBClassifier with tree_method='auto', enable_categorical=False.
  3. We convert the sparse matrix to a scipy.sparse.csr_matrix for compatibility with XGBoost.
  4. We train the model using the sparse training data.
  5. Finally, we make predictions on the sparse test set and print the first five predictions.

By using sparse data structures, XGBoost can handle high-dimensional datasets with a large number of zero values more efficiently, reducing memory usage and training time compared to dense representations.



See Also