Train an XGBoost Model on an Excel File

XGboost models can be fit on data stored in Excel files, if the file is loaded as a NumPy array or Pandas DataFrame.

To load an Excel .xlsx file into a pandas DataFrame, you can use the read_excel function provided by pandas.

This function is very versatile and can handle various Excel-specific options.

Here’s a basic example of how to use it:

First, ensure you have the necessary libraries installed. If not already installed, you can install pandas and openpyxl (which is an optional dependency for handling .xlsx files) using pip:

pip install pandas openpyxl

Then, you can write a Python script to load the Excel file and fit the XGBoost model:

import pandas as pd
from xgboost import DMatrix, train

# content of data.xlsx:
# "A","B","C","target"
# 1,2,3,0
# 4,5,6,1
# 7,8,9,1

# Load data from CSV file
data = pd.read_excel('data.xlsx')

# Separate features and target
X = data.drop('target', axis=1)
y = data['target']

# Create DMatrix from X and y
dmatrix = DMatrix(data=X, label=y)

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'learning_rate': 0.1,
    'random_state': 42
}

# Train the model
model = train(params, dmatrix)

Here’s what’s happening:

We assume that our data, including both input features and the target variable, is stored in an Excel file data.xlsx which is loaded into a single Pandas DataFrame called data.
We separate the input features X and target y from the combined DataFrame using column selection. X is a DataFrame containing only the feature columns, and y is a Series containing the target variable.
We create an instance of the XGBClassifier (or XGBRegressor for regression tasks) and specify our desired hyperparameters.
We directly pass X and y to the fit() method. XGBoost will use these Pandas data structures during training without any need for conversion.

See Also