A New Champion
This past weekend, something happened that we internally call a “Regime Change”: Our price detection model based on LightGBM was replaced by a CatBoost model. That sounds technical — and it is. But the effects are directly noticeable: price detection on your product pages just got more accurate.
To mark the occasion, we want to give you a peek behind the curtain at how our AI-powered price detection works, how we train our models, and why CatBoost now comes out on top.
The Problem: One Number Among Many
A typical product page contains dozens of numbers: article IDs, ratings, shipping costs, quantities, crossed-out prices, variant prices. The actual purchase price is just one of them. Our model needs to pick the right one from all these candidates — on any website, regardless of layout, language, or shop system.
How Our Pipeline Works
Price detection runs in several stages:
Stage 1: Collecting Candidates
When you add a URL for tracking, our system analyzes the complete page structure. Every element that could contain a price is identified. We use multiple sources in parallel:
- Structured data: JSON-LD and Schema.org markup that many shops provide for search engines.
- DOM analysis: Every text element is examined for price-like patterns — numbers with currency symbols, decimal separators, etc.
- JavaScript extraction: For shops with configurable products (e.g., different sizes), we extract variant prices directly from embedded JavaScript.
Stage 2: Feature Extraction
For each price candidate, we compute roughly two dozen features that help the model distinguish the real price from noise:
- HTML context: Does the surrounding element contain words like “price”, “offer”, or “current”? Is the text visually emphasized (bold, large font)?
- Page position: How deeply nested is the element in the DOM? Where does it sit relative to other candidates?
- Statistical context: How does the value compare to other numbers on the page? Is it an outlier or within a typical price range?
- Shop-specific signals: How well does the model historically detect prices on this domain? Some shops are harder than others.
Stage 3: Prediction
All candidates with their features are sent to our ML service. The model scores each candidate with a probability: “How confident am I that this is the actual product price?” The candidate with the highest score wins.
Training: How the Model Learns
Our model learns from real user data. Every time you confirm or correct a price, that feedback flows back as a training signal. The mapping “this number on this page is the correct price” becomes a labeled example for the next training run.
The challenge: out of all candidates on a page, typically only one is the correct price — the ratio is roughly 1:50. This imbalance must be accounted for during training, otherwise the model simply learns to classify everything as “not a price.”
We regularly train multiple model types in parallel and compare their performance on a held-out test set. The test set is strictly split by page — the model is never tested on pages it has seen during training.
Why CatBoost Won
In our latest model comparison, CatBoost outperformed the previous LightGBM model (which had been in production since January) on the key metrics:
- Top-1 accuracy of 80%: For 4 out of 5 product pages, the model identifies the correct price on the first try.
- Top-3 accuracy of 84%: When considering the three best candidates, the correct price is almost always among them.
What makes CatBoost better? Two factors stand out:
Better handling of class imbalance. CatBoost has a built-in strategy for automatic class weight balancing that works more robustly in practice than the manual calibration needed for LightGBM.
Smarter processing of categorical features. Features like HTML tag type or candidate source (JSON-LD vs. DOM text vs. JavaScript) are processed natively by CatBoost, without us having to manually encode them as numbers. This reduces information loss.
Automatic Retraining
Our pipeline doesn’t just train models once — it does so continuously. Every day, the current best model is retrained with new data. Once a week, a full comparison of all model configurations runs — that’s how we discovered the “Regime Change” to CatBoost.
The detector service that performs real-time price detection loads new models automatically. From discovering a better model to deploying it in production takes only minutes.
What This Means for You
In short: better price detection, fewer manual corrections needed. You should especially notice improvements on shops with complex page layouts, multiple price variants, or unusual presentations.
When the model is uncertain, you’ll see it in the confidence indicator during tracker creation. In those cases, you can simply confirm the price manually — and help the model learn for its next training round at the same time.
Try it out and create your next tracker — CatBoost is now handling the price detection.
