XGBoosting Home | About | Contact | Examples

XGBoost Convert DMatrix to Pandas DataFrame

While XGBoost’s DMatrix is the preferred data structure for training and making predictions with XGBoost models, there might be scenarios where you need to convert the DMatrix to a Pandas DataFrame.

For example, you might want to perform data exploration or integrate the data with other Pandas-based workflows.

Here’s how you can convert an XGBoost DMatrix to a Pandas DataFrame:

import numpy as np
import pandas as pd
from xgboost import DMatrix

# Generate synthetic data
X = np.random.rand(100, 5)
y = np.random.randint(2, size=100)

# Create DMatrix from NumPy arrays
dmatrix = DMatrix(data=X, label=y)

# Convert DMatrix to Pandas DataFrame
df = pd.DataFrame(dmatrix.get_data().toarray(), columns=[f'feature_{i}' for i in range(5)])
df['label'] = dmatrix.get_label()

# Print first few rows of original data and converted DataFrame
print("Original Data:")
print(X[:5, :])
print(y[:5])
print("\nConverted DataFrame:")
print(df.head())

In this example:

  1. We generate a synthetic dataset using NumPy and create a DMatrix object dmatrix from the NumPy arrays X and y.

  2. To convert the DMatrix to a Pandas DataFrame, we use the .get_data and .toarray methods of dmatrix to get the feature matrix as a NumPy array. We then create a DataFrame df using this array and assign column names feature_0, feature_1, etc.

  3. We add the label data to the DataFrame by getting the labels from dmatrix using .get_label and assigning them to a new column 'label' in df.

  4. Finally, we print the first few rows of the original data X and y, and the converted DataFrame df using .head() to confirm they match.

Converting a DMatrix to a Pandas DataFrame provides flexibility when you need to work with the data using Pandas functionality.

However, keep in mind that DMatrix is optimized for XGBoost, so converting back and forth between DMatrix and DataFrame might have some overhead. Only convert when necessary and consider using DMatrix directly when possible for optimal performance.



See Also