Longitudinal Grocery Pricing Study

King Kullen
Price Research

A living catalog, crawled weekly.

20,716 unique products. 423 sub-categories. Every item pulled directly from King Kullen's Freshop API — not scraped from pages. Prices recorded weekly via GitHub Actions. The goal is a multi-month time series: enough data to train a pricing model that actually generalizes.

20,716
Unique Products Tracked
423
Sub-Categories Mapped
~15%
Items Currently on Sale
$7.03
Average Regular Price

Exploratory Data Analysis

What the catalog actually looks like

Three angles on the data: shelf-space allocation by category, price variance within categories, and which aisles King Kullen discounts most aggressively.

Top 20 Product Categories by Item Count

Shelf space by category

The 20 most-populated sub-categories by item count. Pantry staples and broad beverage categories dominate the catalog. This asymmetry will be the largest source of model bias in early training runs.

Price Distribution in Top 10 Categories

Price spread per category

Boxplots for the 10 most-populated categories. Wide interquartile ranges signal a mix of generic and premium SKUs in the same aisle. Narrow boxes signal commodity sections where price competition has compressed the range.

Discount Rates by Category

Which aisles run the deepest deals

Categories with the highest proportion of items currently on sale, filtered to categories with 50 or more products. These are the aisles where promotional cadence is the strongest pricing signal.

Most Common Terms in King Kullen Products

The language of the catalog

Word cloud built from 9,834 product names. High-frequency terms like "organic," "oz," and brand names dominate. These are the raw materials for the TF-IDF feature layer in the Ridge regression model.


Price Prediction Model

Ridge regression on product name and category

Two models trained on 20,716 items. One winner. Ridge outperforms Gradient Boosting here because the feature space is high-dimensional and sparse. TF-IDF over product names creates 3,000+ dimensions — exactly the regime where regularized linear models shine.

Winner
Model
Ridge Regression
alpha = 1.0, 5-fold CV
R² Score
0.590
vs. 0.445 on 9k items (+32%)
RMSE
$3.77
Root mean squared error
MAE
$2.21
Mean absolute error
Model Comparison

Ridge vs. Gradient Boosting (20k items)

Ridge at R²=0.590 edges out GBM at 0.519. The 9k baseline was 0.445 — doubling the dataset delivered a clean +32% lift. The gap will continue widening as weekly snapshots accumulate.

Predicted vs Actual Prices

Predicted vs. actual price (20k)

Each dot is a product from the 4,143-item test holdout. Points cluster tightly near the diagonal for everyday staples. The model now generalises visibly better than the 9k run — the scatter cloud is narrower, and the diagonal is cleaner through the $5–$25 range.

Residuals Distribution

Residuals tighten with more data

Bell-shaped, centered near zero, noticeably narrower than the 9k run. RMSE dropped from $5.86 to $3.77 — purely from having 2.1x more training examples. This confirms the earlier hypothesis: data volume was the binding constraint, not model architecture.


Technical Architecture

How the data gets from King Kullen to this page

01
Discovery

Category tree extraction

The crawler boots by parsing King Kullen's homepage PreloadedState JSON, extracting 476 category IDs from the navigation tree. No brittle CSS selectors. If they restructure the nav, the IDs survive in the JSON.

02
API Crawl

Freshop groupby endpoint

Each category ID hits storefrontgateway.shopkingkullen.com/api/stores/23/categories/{id}/groupby with a polite 1-second delay between requests. The endpoint returns up to 100 products per page. The crawler paginates until the page is empty.

03
Storage

Dated JSONL snapshots

Each run writes a fresh data/snapshots/YYYY-MM-DD.jsonl committed back to this repo. Every record carries UPC, name, current price, regular price, category labels, and a UTC timestamp. The git history is the database.

04
Scheduling

GitHub Actions, every Sunday

A schedule: cron workflow runs the full crawl at 06:00 UTC each Sunday. The run takes roughly 11 minutes. The workflow also has workflow_dispatch for manual triggers whenever a mid-week check is needed.

05
Modeling

Ridge regression, retrained on accumulation

As weekly snapshots stack up, the model pipeline retrains on the combined history. Current R² sits at 0.445 on one snapshot. The target is above 0.60 once four or more months of price change data are available as features.