XGboost models can be fit on data stored in Excel files, if the file is loaded as a NumPy array or Pandas DataFrame.
To load an Excel .xlsx
file into a pandas DataFrame, you can use the read_excel
function provided by pandas.
This function is very versatile and can handle various Excel-specific options.
Here’s a basic example of how to use it:
First, ensure you have the necessary libraries installed. If not already installed, you can install pandas and openpyxl (which is an optional dependency for handling .xlsx
files) using pip:
pip install pandas openpyxl
Then, you can write a Python script to load the Excel file and fit the XGBoost model:
import pandas as pd
from xgboost import DMatrix, train
# content of data.xlsx:
# "A","B","C","target"
# 1,2,3,0
# 4,5,6,1
# 7,8,9,1
# Load data from CSV file
data = pd.read_excel('data.xlsx')
# Separate features and target
X = data.drop('target', axis=1)
y = data['target']
# Create DMatrix from X and y
dmatrix = DMatrix(data=X, label=y)
# Set XGBoost parameters
params = {
'objective': 'binary:logistic',
'learning_rate': 0.1,
'random_state': 42
}
# Train the model
model = train(params, dmatrix)
Here’s what’s happening:
We assume that our data, including both input features and the target variable, is stored in an Excel file
data.xlsx
which is loaded into a single Pandas DataFrame calleddata
.We separate the input features
X
and targety
from the combined DataFrame using column selection.X
is a DataFrame containing only the feature columns, andy
is a Series containing the target variable.We create an instance of the
XGBClassifier
(orXGBRegressor
for regression tasks) and specify our desired hyperparameters.We directly pass
X
andy
to thefit()
method. XGBoost will use these Pandas data structures during training without any need for conversion.