XGboost models can be fit on data stored in Excel files, if the file is loaded as a NumPy array or Pandas DataFrame.
To load an Excel .xlsx file into a pandas DataFrame, you can use the read_excel function provided by pandas.
This function is very versatile and can handle various Excel-specific options.
Here’s a basic example of how to use it:
First, ensure you have the necessary libraries installed. If not already installed, you can install pandas and openpyxl (which is an optional dependency for handling .xlsx files) using pip:
pip install pandas openpyxl
Then, you can write a Python script to load the Excel file and fit the XGBoost model:
import pandas as pd
from xgboost import DMatrix, train
# content of data.xlsx:
# "A","B","C","target"
# 1,2,3,0
# 4,5,6,1
# 7,8,9,1
# Load data from CSV file
data = pd.read_excel('data.xlsx')
# Separate features and target
X = data.drop('target', axis=1)
y = data['target']
# Create DMatrix from X and y
dmatrix = DMatrix(data=X, label=y)
# Set XGBoost parameters
params = {
'objective': 'binary:logistic',
'learning_rate': 0.1,
'random_state': 42
}
# Train the model
model = train(params, dmatrix)
Here’s what’s happening:
We assume that our data, including both input features and the target variable, is stored in an Excel file
data.xlsxwhich is loaded into a single Pandas DataFrame calleddata.We separate the input features
Xand targetyfrom the combined DataFrame using column selection.Xis a DataFrame containing only the feature columns, andyis a Series containing the target variable.We create an instance of the
XGBClassifier(orXGBRegressorfor regression tasks) and specify our desired hyperparameters.We directly pass
Xandyto thefit()method. XGBoost will use these Pandas data structures during training without any need for conversion.