We now read every shop's terms of service — and we're failing at exactly the right step

We built a small job that reads each shop's terms of service once and decides whether automated price retrieval for private users is permitted there. The feature went live this week. We wanted to write about it right away — but honestly, it doesn't fully work yet. Not yet. Here's the story.

Why we're doing this at all

When you create a tracker for a shop on DealMonitor, a compliance check has long been running in the background: we fetch the domain's robots.txt and respect what it says. If a shop tells crawlers to stay out, we stay out. That's the machine-readable industry standard and, under German law, a fairly clear line.

What robots.txt doesn't cover: the fine print in a shop's terms of service about "automated retrieval of content." Some shops explicitly forbid scraping, bots, crawlers, or even "automated duplication of contents." Others say nothing about it. Others again explicitly allow it for non-commercial use.

So far that has been a blind spot for us. We wanted to open it up — not because a lawyer knocked, but because it's the more honest way to deal with shops. And because a tracker whose shop has explicitly forbidden scraping is going to hit a wall sooner or later, legally or technically.

What we built

The pipeline is conceptually simple:

Find the TOS link — we fetch the shop's homepage and scan it for footer links that plausibly lead to terms of service in one of six languages (German, English, French, Italian, Spanish, Dutch) — terms like "AGB," "Terms of Service," "Conditions générales." Off-domain links are rejected (anyone hosting their terms at a law firm gets skipped — too easy to be spoofed).
Fetch the TOS text — we load the page, strip out scripts, styles and navigation, and keep the actual contract text.
LLM classification — we hand the text to a small language model with the question: "Do these terms allow a private user to retrieve prices automatically, for non-commercial purposes, at low frequency?" Answer comes back as ALLOWED, DISALLOWED, or UNCLEAR plus a one-sentence reason in German.
Persist — the result lands on the shop record in our database and shows up the next morning in our daily status report next to the existing robots.txt verdict.

What the daily status would look like when it all works:

🆕 5 new shops (24h):
  • example.de — ✓ robots ok — ✓ TOS ok — 3 trackers
  • shop.com — ✓ robots ok — 🚫 TOS forbids scraping — 1 tracker
  • mystery.io — ❓ robots not checked — ❓ TOS not found — 1 tracker

Where it breaks: LLM inference

We wanted to do this locally, on purpose. A small server of our own, an open-source model (gemma4:e4b on Ollama), no data sent to OpenAI, Anthropic, or Google. That's a position we like — you shouldn't have to guess where your requests are processed. We don't want to guess either.

The box that runs it has 12 GB of RAM and no GPU — deliberately small. On the first real test against a normal shop (a Shopify TOS with ~13 KB of plain text), here's what we get:

"Say hi in one word." → 9 seconds ✓
4 KB of TOS text plus classification prompt → timeout after 180 seconds ✗
40 KB of TOS text plus classification prompt → timeout after 180 seconds ✗

Pure CPU inference takes several minutes per shop for that kind of classification on the small box. With a cap of 20 shops per daily run, we'd be looking at almost an hour of inference time in the 4 a.m. slot — that won't run reliably.

What works (and what we're building on)

The first two pipeline steps are solid. We tested against three random shops (cabletex.de, mediamarkt.de, thomann.de):

cabletex.de → /policies/terms-of-service ✓
mediamarkt.de → /de/legal/terms ✓
thomann.de → /compinfo_terms.html ✓

So the TOS URL is reliably discovered — which is the step most likely to break in practice, since every shop arranges its footer differently. Persistence, the daily-status wiring, and the tracking_status field that will let us drop disallowed shops from the scraping pool automatically all work cleanly.

What's missing is working inference. Three realistic options:

More hardware — a GPU in the LXC container. That brings 120 seconds down to 2–5 seconds. Cleanest solution, costs a bit more power and one more round of tinkering at the host level.
Hosted API with a clear contract — Claude Haiku or a similarly small hosted model. 200 milliseconds instead of 120 seconds. At our frequency (~10 calls per day) the cost is under ten cents per month. But it does mean the TOS text gets sent to an external provider once per shop. No customer data involved, but it's not "fully on our side" anymore.
Drop the LLM, keyword heuristic — if words like "scraping forbidden," "crawler," "automated retrieval" appear in the TOS text, mark DISALLOWED, otherwise ALLOWED. Available immediately, imprecise, lots of false positives on shops that use those words in the context of "we protect ourselves from…"

We're leaning toward Option 2 as the interim solution — with an honest disclosure that this is how it runs — and Option 1 as the end state, once GPU is available.

What changes for you today (spoiler: little)

Concretely: nothing. Trackers run as before, the robots.txt check stays active, shops that have manually opted out via the shop opt-out form stay opted out. The TOS check will go active once we've sorted the inference question — and as soon as it returns results, you'll see them in the daily status next to the robots.txt verdict.

We could have waited and written this once everything was fixed — clean story, all shiny. Instead we're writing it now, because two things about it matter: that we're doing this check at all, and that sometimes the honest answer is "doesn't work yet, here's the plan."

One request

If you're a shop owner reading this and you have clear rules in your TOS — drop us a quick mail at [email protected]. A clean manual flag is always better than what our model guesses. And if you're a user and you know a shop that explicitly forbids scraping: same way.