The top_k
parameter in XGBoost is used to control feature selection when training linear models.
It specifies the number of top features to select in each boosting iteration based on the absolute values of their coefficients. This can be particularly useful for high-dimensional datasets where many features may be irrelevant or redundant.
By default, top_k
is set to 0, which means all features are considered. However, when top_k
is set to a positive integer k, the algorithm will select the top k features each iteration. This can help improve model performance and reduce training time on large datasets.
Setting the top_k
parameter requires setting pdater='coord_descent'
and feature_selector
to 'greedy'
or 'thrifty'
.
Here’s an example demonstrating how to use the top_k
parameter with XGBoost’s linear booster:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import time
# Generate a synthetic regression dataset with 10,000 samples and 100 features
X, y = make_regression(n_samples=10000, n_features=100, noise=0.1, random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize an XGBRegressor with linear booster and top_k=5
model_top_k = XGBRegressor(booster='gblinear', updater='coord_descent', feature_selector='greedy', top_k=5)
# Initialize an XGBRegressor with linear booster and default top_k (0=all)
model_default = XGBRegressor(booster='gblinear', updater='coord_descent', feature_selector='greedy')
# Train the models and measure training time
start_time = time.time()
model_top_k.fit(X_train, y_train)
top_k_train_time = time.time() - start_time
start_time = time.time()
model_default.fit(X_train, y_train)
default_train_time = time.time() - start_time
# Evaluate the models on the test set
top_k_preds = model_top_k.predict(X_test)
default_preds = model_default.predict(X_test)
top_k_mse = mean_squared_error(y_test, top_k_preds)
default_mse = mean_squared_error(y_test, default_preds)
print(f"Model with top_k=5: MSE = {top_k_mse:.4f}, Training Time = {top_k_train_time:.2f}s")
print(f"Model with default top_k: MSE = {default_mse:.4f}, Training Time = {default_train_time:.2f}s")
In this example, we generate a high-dimensional synthetic regression dataset with 10,000 samples and 100 features using make_regression()
from scikit-learn. We then split the data into training and testing sets.
Next, we initialize two XGBRegressor
instances with booster='gblinear'
. One model uses top_k=5
to select the top 5 features in each boosting iteration, while the other uses the default top_k
value (consider all features).
We train both models and measure their training times. Finally, we evaluate the models’ performance on the test set using mean squared error (MSE) and print the results.
The optimal value of top_k
will depend on the specific dataset and problem. Users may need to experiment with different values to find the best balance between model performance and training time. A good starting point is a small value (e.g., 5 or 10), which can be increased if needed based on the results.
It’s important to note that setting top_k
too low may exclude important features and degrade model performance, while setting it too high may not provide much benefit in terms of performance or training time.