What this is

The Higgs challenge asks whether a simulated particle collision event is signal, meaning consistent with Higgs boson production, or background. The pipeline handles the original Kaggle feature format, converts the -999 sentinel values to missing values, imputes and scales numeric features, checks class imbalance, trains three classifiers, and reports hold-out plus cross-validation results.

Reproducible pipeline

All training logic lives in src/higgs_pipeline.py, with a notebook wrapper for review.

Physics-aware cleaning

The Kaggle sentinel value -999 is converted to missing data before median imputation.

Reviewable outputs

Metrics, cross-validation tables, model tuning JSON, and result plots are saved under outputs/real.

Model results

XGBoost produced the strongest baseline ROC-AUC and improved after grid search tuning.

Model Precision Recall F1 score ROC-AUC
Tuned XGBoost0.71600.82600.76710.9093
XGBoost baseline0.70780.82570.76220.9063
Random Forest0.76540.75800.76170.9049
Logistic Regression0.58840.76000.66330.8133

Screenshots and plots

These are generated directly by the pipeline and committed as static artifacts so the GitHub Pages site can show the assignment output without rerunning training.

ROC curves comparing XGBoost, Random Forest, and Logistic Regression

ROC curves

XGBoost and random forest both separate signal from background well, with XGBoost leading.

Precision recall curves for the classification models

Precision-recall curves

Shows signal detection tradeoffs under class imbalance.

Bar chart of top permutation importance features

Top features

DER_mass_MMC, DER_mass_transverse_met_lep, and DER_mass_vis carried the strongest model signal.

Target class distribution showing background and signal counts

Target distribution

The dataset has more background than signal events, so sampling was applied inside training folds.

Run locally

The real Kaggle CSV is intentionally excluded from git. Download it from Kaggle, place it at data/training.csv, then run the pipeline.

python3 -m venv .venv
.venv/bin/python -m pip install -r requirements.txt
kaggle competitions download -c higgs-boson -f training.csv.zip -p data
unzip data/training.csv.zip -d data
.venv/bin/python src/higgs_pipeline.py --data data/training.csv --output-dir outputs/real --cv-folds 3

Repository map

Notebook

Higgs_Boson_Classification_Pipeline.ipynb explains the assignment workflow and loads generated plots.

Pipeline

src/higgs_pipeline.py contains preprocessing, training, evaluation, tuning, and plot generation.

Outputs

outputs/real contains CSV metrics, tuning JSON, summary markdown, and chart images.

Pages site

index.html and assets/styles.css provide the GitHub Pages explainer.