Text Input Features for XGBoost

Data

XGBoost is a powerful machine learning library, but it does not natively support text input features such as strings.

To use text data with XGBoost, you must first transform the text into numerical representations that the algorithm can process.

from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# Example text data
X = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Binary target variable (sentiment)
y = [1, 0, 1, 0]  # 1: positive, 0: negative

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Transform text data into TF-IDF matrix
X_tfidf = vectorizer.fit_transform(X)

# Initialize and train XGBoost model
model = XGBClassifier(random_state=42)
model.fit(X_tfidf, y)

# New text data for prediction
X_new = [
    "This is a new document.",
    "Yet another document for testing.",
]

# Transform new text data into TF-IDF matrix
X_new_tfidf = vectorizer.transform(X_new)

# Make predictions
predictions = model.predict(X_new_tfidf)

print("Predictions:", predictions)

Here’s a step-by-step breakdown:

Initialize a text vectorizer, such as TfidfVectorizer from scikit-learn, which will transform the text into numerical features. In this example, we use TF-IDF (Term Frequency-Inverse Document Frequency), but other techniques like bag-of-words or word embeddings can also be used.
Fit the vectorizer on the training text data using fit_transform(). This learns the vocabulary and document frequencies from the training data and transforms the text into a matrix of TF-IDF features.
Initialize an XGBClassifier (or XGBRegressor for regression tasks) with the desired hyperparameters. Here, we set random_state for reproducibility.
Train the XGBoost model using the transformed text features X_tfidf and the target variable y.
When making predictions on new text data, first transform the new data using the fitted vectorizer’s transform() method. This ensures that the new data is transformed consistently with the training data.
Use the trained model to make predictions on the transformed new text data using predict().

By preprocessing the text data into numerical features, you can effectively use XGBoost for tasks involving text input. Keep in mind that the choice of text vectorization technique (e.g., TF-IDF, word embeddings) and its parameters may impact the model’s performance, so experimentation and tuning are often necessary to achieve the best results.

See Also