DealMonitor Logo
Back to Blog
We now read every shop's terms of service β€” and we're failing at exactly the right step

We now read every shop's terms of service β€” and we're failing at exactly the right step

Β·by DealMonitor TeamΒ·6 min read
compliancetosagbllmtransparency

We built a small job that reads each shop's terms of service once and decides whether automated price retrieval for private users is permitted there. The feature went live this week. We wanted to write about it right away β€” but honestly, it doesn't fully work yet. Not yet. Here's the story.

Why we're doing this at all

When you create a tracker for a shop on DealMonitor, a compliance check has long been running in the background: we fetch the domain's robots.txt and respect what it says. If a shop tells crawlers to stay out, we stay out. That's the machine-readable industry standard and, under German law, a fairly clear line.

What robots.txt doesn't cover: the fine print in a shop's terms of service about "automated retrieval of content." Some shops explicitly forbid scraping, bots, crawlers, or even "automated duplication of contents." Others say nothing about it. Others again explicitly allow it for non-commercial use.

So far that has been a blind spot for us. We wanted to open it up β€” not because a lawyer knocked, but because it's the more honest way to deal with shops. And because a tracker whose shop has explicitly forbidden scraping is going to hit a wall sooner or later, legally or technically.

What we built

The pipeline is conceptually simple:

  1. Find the TOS link β€” we fetch the shop's homepage and scan it for footer links that plausibly lead to terms of service in one of six languages (German, English, French, Italian, Spanish, Dutch) β€” terms like "AGB," "Terms of Service," "Conditions gΓ©nΓ©rales." Off-domain links are rejected (anyone hosting their terms at a law firm gets skipped β€” too easy to be spoofed).
  2. Fetch the TOS text β€” we load the page, strip out scripts, styles and navigation, and keep the actual contract text.
  3. LLM classification β€” we hand the text to a small language model with the question: "Do these terms allow a private user to retrieve prices automatically, for non-commercial purposes, at low frequency?" Answer comes back as ALLOWED, DISALLOWED, or UNCLEAR plus a one-sentence reason in German.
  4. Persist β€” the result lands on the shop record in our database and shows up the next morning in our daily status report next to the existing robots.txt verdict.

What the daily status would look like when it all works:

πŸ†• 5 new shops (24h):
  β€’ example.de β€” βœ“ robots ok β€” βœ“ TOS ok β€” 3 trackers
  β€’ shop.com β€” βœ“ robots ok β€” 🚫 TOS forbids scraping β€” 1 tracker
  β€’ mystery.io β€” ❓ robots not checked β€” ❓ TOS not found β€” 1 tracker

Where it breaks: LLM inference

We wanted to do this locally, on purpose. A small server of our own, an open-source model (gemma4:e4b on Ollama), no data sent to OpenAI, Anthropic, or Google. That's a position we like β€” you shouldn't have to guess where your requests are processed. We don't want to guess either.

The box that runs it has 12 GB of RAM and no GPU β€” deliberately small. On the first real test against a normal shop (a Shopify TOS with ~13 KB of plain text), here's what we get:

  • "Say hi in one word." β†’ 9 seconds βœ“
  • 4 KB of TOS text plus classification prompt β†’ timeout after 180 seconds βœ—
  • 40 KB of TOS text plus classification prompt β†’ timeout after 180 seconds βœ—

Pure CPU inference takes several minutes per shop for that kind of classification on the small box. With a cap of 20 shops per daily run, we'd be looking at almost an hour of inference time in the 4 a.m. slot β€” that won't run reliably.

What works (and what we're building on)

The first two pipeline steps are solid. We tested against three random shops (cabletex.de, mediamarkt.de, thomann.de):

  • cabletex.de β†’ /policies/terms-of-service βœ“
  • mediamarkt.de β†’ /de/legal/terms βœ“
  • thomann.de β†’ /compinfo_terms.html βœ“

So the TOS URL is reliably discovered β€” which is the step most likely to break in practice, since every shop arranges its footer differently. Persistence, the daily-status wiring, and the tracking_status field that will let us drop disallowed shops from the scraping pool automatically all work cleanly.

What's missing is working inference. Three realistic options:

  1. More hardware β€” a GPU in the LXC container. That brings 120 seconds down to 2–5 seconds. Cleanest solution, costs a bit more power and one more round of tinkering at the host level.
  2. Hosted API with a clear contract β€” Claude Haiku or a similarly small hosted model. 200 milliseconds instead of 120 seconds. At our frequency (~10 calls per day) the cost is under ten cents per month. But it does mean the TOS text gets sent to an external provider once per shop. No customer data involved, but it's not "fully on our side" anymore.
  3. Drop the LLM, keyword heuristic β€” if words like "scraping forbidden," "crawler," "automated retrieval" appear in the TOS text, mark DISALLOWED, otherwise ALLOWED. Available immediately, imprecise, lots of false positives on shops that use those words in the context of "we protect ourselves from…"

We're leaning toward Option 2 as the interim solution β€” with an honest disclosure that this is how it runs β€” and Option 1 as the end state, once GPU is available.

What changes for you today (spoiler: little)

Concretely: nothing. Trackers run as before, the robots.txt check stays active, shops that have manually opted out via the shop opt-out form stay opted out. The TOS check will go active once we've sorted the inference question β€” and as soon as it returns results, you'll see them in the daily status next to the robots.txt verdict.

We could have waited and written this once everything was fixed β€” clean story, all shiny. Instead we're writing it now, because two things about it matter: that we're doing this check at all, and that sometimes the honest answer is "doesn't work yet, here's the plan."

One request

If you're a shop owner reading this and you have clear rules in your TOS β€” drop us a quick mail at [email protected]. A clean manual flag is always better than what our model guesses. And if you're a user and you know a shop that explicitly forbids scraping: same way.

Ready to Never Miss a Deal Again?

Start tracking prices in seconds. No credit card required.

Start for Free

Related Posts

When shops lock us out β€” why some prices don't refresh

When shops lock us out β€” why some prices don't refresh

5 min read

Three small features that make DealMonitor better in everyday use

Three small features that make DealMonitor better in everyday use

4 min read

Smarter Price Alerts and Self-Healing Trackers

Smarter Price Alerts and Self-Healing Trackers

3 min read

Invita amici e sblocca spazi di monitoraggio

Invita amici e sblocca spazi di monitoraggio

3 min di lettura

Importa le tue liste dei desideri β€” Steam e Amazon con un clic

Importa le tue liste dei desideri β€” Steam e Amazon con un clic

4 min di lettura

1 anno di DealMonitor: dall'idea al tracker di prezzi

1 anno di DealMonitor: dall'idea al tracker di prezzi

4 min di lettura

v0.12: HTTP-First Scraping and the End of Selenium Dependency

v0.12: HTTP-First Scraping and the End of Selenium Dependency

5 min read

Cambio di regime: come CatBoost ha spodestato il nostro precedente modello di rilevamento prezzi

Cambio di regime: come CatBoost ha spodestato il nostro precedente modello di rilevamento prezzi

5 min di lettura

The 5 Best Price Comparison Tools in 2026 β€” Compared

The 5 Best Price Comparison Tools in 2026 β€” Compared

7 min read

Amazon Price History: How to Track Prices the Right Way

Amazon Price History: How to Track Prices the Right Way

6 min read

v0.11: API Integrations for Etsy, Game Stores, and Multi-Price Tracking

v0.11: API Integrations for Etsy, Game Stores, and Multi-Price Tracking

5 min read

Già i regali di Natale in mente? Sì, anche a marzo è una buona idea.

Già i regali di Natale in mente? Sì, anche a marzo è una buona idea.

5 min di lettura

v0.10: Tackling Amazon and AliExpress with APIs

v0.10: Tackling Amazon and AliExpress with APIs

5 min read

Come i negozi online ti ingannano per farti comprare β€” e come difenderti

Come i negozi online ti ingannano per farti comprare β€” e come difenderti

8 min di lettura

DealMonitor entra in beta: tutte le novitΓ 

DealMonitor entra in beta: tutte le novitΓ 

5 min di lettura

La guida definitiva allo shopping online intelligente: strategie, strumenti e consigli pratici

La guida definitiva allo shopping online intelligente: strategie, strumenti e consigli pratici

6 min di lettura

v0.8: Dark Mode, Error Monitoring, and Our First Blog Posts

v0.8: Dark Mode, Error Monitoring, and Our First Blog Posts

4 min read

v0.7: Web Push Notifications and Dashboard Search

v0.7: Web Push Notifications and Dashboard Search

5 min read

Come l'IA rileva i prezzi su qualsiasi sito web: la tecnologia dietro il monitoraggio

Come l'IA rileva i prezzi su qualsiasi sito web: la tecnologia dietro il monitoraggio

6 min di lettura

v0.6: Telegram Notifications, Tracker Groups, and Sharing

v0.6: Telegram Notifications, Tracker Groups, and Sharing

4 min read

5 modi per risparmiare con gli avvisi di prezzo: strategie che funzionano davvero

5 modi per risparmiare con gli avvisi di prezzo: strategie che funzionano davvero

5 min di lettura

v0.5: Google OAuth and 9 Languages from Day One

v0.5: Google OAuth and 9 Languages from Day One

4 min read

Come monitorare i prezzi online: la guida completa per acquisti intelligenti

Come monitorare i prezzi online: la guida completa per acquisti intelligenti

5 min di lettura

We now read every shop's terms of service β€” and we're failing at exactly the right step