XGBoost Convert DMatrix to NumPy Array

While XGBoost’s DMatrix is an optimized data structure for efficient computation and memory usage, there might be scenarios where you need to convert it back to NumPy arrays.

For example, you might want to perform custom preprocessing or postprocessing on the data, or integrate XGBoost with other libraries that work with NumPy arrays.

Here’s how you can convert a DMatrix to NumPy arrays:

import numpy as np
from xgboost import DMatrix, train

# Generate synthetic data
X = np.random.rand(100, 5)
y = np.random.randint(2, size=100)

# report details of array
print(X[:5, :])
print(y[:5])

# Create DMatrix from NumPy arrays
dmatrix = DMatrix(data=X, label=y)

# Convert DMatrix to NumPy arrays (assuming no missing values)
X_array = dmatrix.get_data().toarray()
y_array = dmatrix.get_label()

# report details of array
print(X_array[:5, :])
print(y_array[:5])

In this example:

We generate a synthetic dataset using NumPy, report some values, and create a DMatrix object dmatrix from the NumPy arrays X and y.
To convert the DMatrix back to NumPy arrays, we use the .get_data and .toarray methods of dmatrix. .get_data gives us the feature matrix as a NumPy array, while .toarray converts the matrix to a NumPy array. We store these in X_array and y_array, respectively.
We then report values again and confirm they match the original data.

If your DMatrix contains additional information like feature types, you can access them using the .feature_types attribute.

Converting a DMatrix to NumPy arrays provides flexibility when you need to work with the data outside of XGBoost.

However, keep in mind that DMatrix is optimized for XGBoost, so converting back and forth between DMatrix and NumPy arrays might have some overhead. Only convert when necessary and consider using DMatrix directly when possible for optimal performance.

See Also