XGBoost for the KDDCup99 Dataset

The KDDCup99 dataset is a widely used benchmark for anomaly detection and network intrusion detection systems.

In this example, we’ll load the KDDCup99 dataset using scikit-learn’s fetch_kddcup99() function, perform hyperparameter tuning using GridSearchCV with common XGBoost parameters, save the best model, load it, and use it to make predictions.

from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

# Load the KDDCup99 dataset
X, y = fetch_kddcup99(return_X_y=True)

# Print key information about the dataset
print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Class Distributions: {Counter(y)}")

# Encode categorical variables
nominal = [1, 2, 3]
transformer = ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), nominal)], remainder='passthrough')
# Perform ordinal encoding
X = transformer.fit_transform(X)

# Encode target variable
y = LabelEncoder().fit_transform(y)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define parameter grid
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create XGBClassifier
model = XGBClassifier(objective='multi:softmax', random_state=42, n_jobs=1)

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Access best model
best_model = grid_search.best_estimator_

# Save best model
best_model.save_model('best_model_kddcup99.ubj')

# Load saved model
loaded_model = XGBClassifier()
loaded_model.load_model('best_model_kddcup99.ubj')

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

# Print accuracy score and confusion matrix
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.3f}")

Running the example, you will see results similar to the following:

Dataset shape: (494021, 41)
Number of classes: 23
Class Distributions: Counter({b'smurf.': 280790, b'neptune.': 107201, b'normal.': 97278, b'back.': 2203, b'satan.': 1589, b'ipsweep.': 1247, b'portsweep.': 1040, b'warezclient.': 1020, b'teardrop.': 979, b'pod.': 264, b'nmap.': 231, b'guess_passwd.': 53, b'buffer_overflow.': 30, b'land.': 21, b'warezmaster.': 20, b'imap.': 12, b'rootkit.': 10, b'loadmodule.': 9, b'ftp_write.': 8, b'multihop.': 7, b'phf.': 4, b'perl.': 3, b'spy.': 2})
...

In this example, we load the KDDCup99 dataset using fetch_kddcup99() from scikit-learn. We print key information about the dataset, such as its shape, number of classes, and class distribution.

Next, we split the data into train and test sets and define a parameter grid with common XGBoost hyperparameters. We create an instance of XGBClassifier and perform a grid search using GridSearchCV with 3-fold cross-validation. After fitting the grid search object, we print the best score and corresponding best parameters.

We access the best model using best_estimator_ and save it to a file named ‘best_model_kddcup99.ubj’. To demonstrate loading the saved model, we create a new XGBClassifier instance and load the saved model using load_model().

Finally, we use the loaded model to make predictions on the test set and print the accuracy score and confusion matrix.

By following this approach, you can effectively apply XGBoost to the KDDCup99 dataset for anomaly detection, perform hyperparameter tuning, save the best model, and use it for making predictions.

See Also