XGBoost Print Data in DMatrix

When working with XGBoost, you may need to inspect the data stored in a DMatrix object for debugging or to integrate with other tools.

Here’s how you can access and print the data in a DMatrix.

import numpy as np
from xgboost import DMatrix

# Generate synthetic data
n_samples, n_features = 10, 5
X = np.random.rand(n_samples, n_features)
y = np.random.randint(2, size=n_samples)

# Create DMatrix
dmatrix = DMatrix(data=X, label=y)

# Access and print data
print("Feature Matrix:")
print(dmatrix.get_data().toarray()[:5, :])  # Print first 5 rows

print(dmatrix.get_label()[:5])    # Print first 5 labels

In this example:

  1. We generate a small synthetic dataset using NumPy with 10 samples and 5 features. The features are random floats between 0 and 1, and the labels are randomly assigned 0 or 1.

  2. We create a DMatrix object dmatrix from the synthetic data X and labels y.

  3. To access the data in the DMatrix, we use:

    • dmatrix.get_data(): Returns the feature matrix as a scipy.sparse matrix.
    • toarray(): Returns a NumPy dense array
    • dmatrix.get_label(): Returns the labels as a NumPy array.
  4. We print the first 5 rows of the feature matrix and the first 5 labels to confirm the data is stored correctly in the DMatrix.

The output will look something like:

Feature Matrix:
[[0.76826936 0.3751373  0.13454452 0.95997924 0.1613448 ]
 [0.20457081 0.39761412 0.23949952 0.65726465 0.12632865]
 [0.01898127 0.8946074  0.9941333  0.25311878 0.9032138 ]
 [0.5629613  0.6073675  0.17431487 0.07749726 0.5096905 ]
 [0.5430099  0.1685548  0.89152557 0.08665203 0.33809692]]

[1. 1. 0. 1. 0.]

This example demonstrates how to quickly inspect the data in a DMatrix. Keep in mind that for large datasets, you may want to print only a subset of the data to avoid flooding your output.

