Case Study

How a price-monitoring scraper cut pharmacy purchasing costs by 25%

Zenith Automate | February 10, 2026 · 10 min read

A daily scraper turned scattered, manual price checks across pharmaceutical wholesalers into a single source of truth, and cut purchasing costs by 25% on 100k+ products. The full story: the problem, the architecture, the hard parts (normalisation and validation), the results, and what it takes to keep a system like this alive.

A pharmacy buyer's job sounds simple: buy the right products, at the best price, from the right wholesaler. In practice it means checking thousands of prices that change every day, across suppliers that each have their own login, their own product codes, and their own idea of what "available" means. Done by hand, it is slow, error-prone, and impossible to do at scale.

This is the full story of replacing that manual grind with a scraper that runs every morning before the team logs in, and the 25% drop in purchasing costs that followed. It is also a concrete, real-world example of the principles in my field guide to reliable data pipelines, so I will go deeper than a typical case study into how it actually works.

100k+

products tracked daily

−25%

purchasing cost

98%

data accuracy

Key takeaways

The problem was never "scrape a website". It was "always know the cheapest reliable source, before the buying decision".
Reliability beat cleverness. The whole system was built to be trustworthy every single morning, not impressive in a demo.
Normalisation across suppliers is where the value hid: the same product, comparable across catalogues with different codes and names.
Validation made the data safe to act on. A 25% cost reduction came from acting on complete, validated, historical data instead of a stale sample.

The problem: prices move faster than people can check them

The client monitored more than 100,000 products. Prices and stock levels shifted daily across multiple wholesalers. The existing process was a person opening tabs, copying numbers into a spreadsheet, and eyeballing which supplier was cheaper that day.

Three things broke repeatedly:

Coverage. Nobody can check 100k products by hand, so most decisions were made on a small, stale sample, maybe a few hundred of the most obvious items. The long tail was simply ignored, and the long tail is where a lot of the savings hide.
Timing. By the time a manual comparison was finished, the prices had already changed. The work was out of date before it was done.
Trust. Manual copy-paste introduced errors, and errors in purchasing decisions cost real money. A single transposed digit could turn a good deal into a bad one.

The goal was not "scrape a website." It was "always know the cheapest reliable source for every product, before the buying decisions are made." That reframing changed everything about how the system was built.

The approach: a boring, reliable daily pipeline

The interesting engineering here is not cleverness, it is reliability. A scraper that works 95% of the time is worse than useless for purchasing, because you cannot tell which 5% is wrong, so you cannot fully trust any of it. The whole system was built around being trustworthy and repeatable, in the same four stages every reliable scraper uses.

1
Fetch
Authenticate into each wholesaler, handle anti-bot protection, and walk the full catalogue with pagination and retries, without getting blocked.
2
Parse
Turn each supplier's pages into structured rows, resilient to small layout changes, using stable anchors rather than brittle selectors.
3
Validate
Reject prices outside a sane range, missing products, or pages that did not load, before anything is stored. No bad row reaches a buyer.
4
Deliver
Write a dated price history and a side-by-side comparison straight into the buyers' Excel workflow, with alerts on anything unusual.

The architecture, briefly

Under the hood it is deliberately simple: Python for the scraping and processing, a scheduled job that runs each morning, a lightweight database for the dated history, and an export step that produces the comparison the buyers actually use. No exotic infrastructure, because exotic infrastructure is one more thing to break at 9am. The cleverness went into the data, not the stack.

Logins and anti-bot handling

Each wholesaler sits behind a login, and some actively discourage automated access. The scraper authenticates like a real session, reuses its cookies, paces its requests with jitter, and fails loudly. If a login flow changes, the run stops and notifies rather than silently scraping a logged-out error page and writing garbage into the dataset. That single discipline, fail loud rather than fail quiet, prevents the most expensive class of error.

Normalisation is where the value hides

The hardest part was not fetching pages, it was matching the same product across suppliers that use different names, different codes, and different ways of describing the same item. "Cheapest source for product X" is only a question the data can answer if X means the same thing in every catalogue.

So a normalisation layer maps every supplier's catalogue onto one shared product schema, with a canonical identity per product. This is genuinely hard, full of edge cases and near-duplicates, and it is exactly the kind of work no off-the-shelf tool does for you. It is also where most of the value was created: once products are comparable, the comparison is trivial; until they are, no amount of scraping helps.

Validation before anyone sees a number

Before any row is stored, it is checked. Is the price within a sane range for that product? Is the product still recognised, or did the structure shift so we are reading the wrong field? Did the page actually load real content? Anything suspicious is flagged, not silently trusted.

There is also a volume check: if a wholesaler that normally returns tens of thousands of products suddenly returns a few hundred, something broke upstream, and the run halts rather than overwriting good history with a broken snapshot. This is the difference between a demo and a tool people make purchasing decisions on.

Data is the new oil. It's valuable, but if unrefined it cannot really be used.

Clive Humby · Data science pioneer

That refining, the normalisation and validation, is exactly what turned a pile of scraped numbers into decisions worth real money.

The result: −25% costs, every day, automatically

Once the system was running, the difference was not subtle.

Before: hand-checked sample~3% of catalogue

After: validated daily pipeline100% of 100k+

What the buying team could actually act on, before and after. Coverage you can trust changes the decision, not just the dashboard.

100k+ products were monitored daily instead of a hand-picked sample, so the long tail stopped being invisible.
Purchasing costs dropped 25%, driven by always buying from the cheapest reliable source and spotting trends in the dated history that a manual process could never see.
The morning comparison landed in Excel before the team started work, with alerts when prices moved unusually, turning a daily chore into a glance.

The buyers stopped doing data entry and went back to making decisions, now backed by complete, fresh, validated numbers instead of a stale spreadsheet.

Where the 25% actually came from

A 25% reduction sounds dramatic, so it is worth being precise about how a price scraper produces it, because it is not magic and it is not just "find the cheapest price." It comes from several compounding effects that only a complete, historical dataset makes possible:

The long tail stopped being invisible. Manual checking covered the obvious, high-volume products. The scraper covered everything, including thousands of low-attention items where small overpayments quietly added up across the whole catalogue.
Timing improved. With fresh data every morning, buying decisions matched the current cheapest reliable source instead of last week's. Prices move; acting on stale numbers leaves money on the table constantly.
History exposed patterns. A dated record of every price made trends visible: which suppliers cycle their pricing, when discounts appear, which products are volatile. You cannot see a pattern in a snapshot, only in a history, and the history is what turned reactive buying into informed buying.
Errors disappeared. Manual copy-paste mistakes that previously turned good deals into bad ones simply stopped, because no human was transcribing numbers any more.

No single one of these is 25%. Together, compounding across 100,000 products every day, they are. That is the difference between "a scraper that finds prices" and "a system that changes how purchasing decisions get made."

The alerting that makes it actionable

A daily export is useful; an export plus alerts is actionable. The system does not just publish numbers, it flags the ones that matter: a product whose price dropped sharply at one supplier, a sudden stock-out, a price that moved outside its historical range. The buyers do not have to scan 100,000 rows looking for opportunities; the opportunities surface themselves.

This is a small thing technically and a large thing in practice. It is the difference between a tool people could use and a tool people do use, every morning, because it points them straight at the decisions worth making.

What it takes to keep a system like this alive

A scraper like this is never "done." The sources change, and that is the job, not a bug:

Layout changes break crawlers. Monitoring catches them within a day, not a quarter, so a broken source never quietly poisons weeks of decisions.
New wholesalers can be added as another normalised source without touching the comparison logic, because the schema is shared.
Validation rules get tighter over time as new edge cases show up in real data. Every weird record that slips through becomes a new rule.

This is precisely why the Run phase exists, and why it is a retainer rather than a one-off. The value is recurring, so the relationship is too.

How I would approach the same problem for you

If you have a version of this, prices, stock, listings, competitor data spread across sources you check by hand, the path is the same shape every time:

1
Map it
A short audit to find where the cost actually leaks and which sources are worth automating. See the full process.
2
Prove it
A pilot on your real data, fast, so you see the value before committing a real budget.
3
Build and run it
The production pipeline, validated and monitored, then kept healthy as the sources change.

Frequently asked questions

Want the same for your data?

If you are checking prices, stock, or competitor data by hand today, there is almost always a version of this that runs while you sleep. That is the kind of project I take.

Read the deeper web scraping field guide, see the related Gemini.pl tracker covering 100k+ products, explore the web scraping and data pipelines service, understand the pricing model, or tell me what you need to collect.

Have a process worth automating?

Tell me about it, I’ll reply within 24 hours.

Start a conversation

Related notes

Engineering

Web scraping at scale: the complete field guide to reliable data pipelines

March 12, 202616 min read

Playbook

From idea to MVP in two weeks: the complete guide to shipping fast without dropping quality

May 28, 202611 min read

Case Study

How a price-monitoring scraper cut pharmacy purchasing costs by 25%

Zenith Automate | February 10, 2026 · 10 min read

100k+

products tracked daily

−25%

purchasing cost

98%

data accuracy

Key takeaways

The problem was never "scrape a website". It was "always know the cheapest reliable source, before the buying decision".
Reliability beat cleverness. The whole system was built to be trustworthy every single morning, not impressive in a demo.
Normalisation across suppliers is where the value hid: the same product, comparable across catalogues with different codes and names.
Validation made the data safe to act on. A 25% cost reduction came from acting on complete, validated, historical data instead of a stale sample.

The problem: prices move faster than people can check them

Three things broke repeatedly:

Coverage. Nobody can check 100k products by hand, so most decisions were made on a small, stale sample, maybe a few hundred of the most obvious items. The long tail was simply ignored, and the long tail is where a lot of the savings hide.
Timing. By the time a manual comparison was finished, the prices had already changed. The work was out of date before it was done.
Trust. Manual copy-paste introduced errors, and errors in purchasing decisions cost real money. A single transposed digit could turn a good deal into a bad one.

The approach: a boring, reliable daily pipeline

1
Fetch
Authenticate into each wholesaler, handle anti-bot protection, and walk the full catalogue with pagination and retries, without getting blocked.
2
Parse
Turn each supplier's pages into structured rows, resilient to small layout changes, using stable anchors rather than brittle selectors.
3
Validate
Reject prices outside a sane range, missing products, or pages that did not load, before anything is stored. No bad row reaches a buyer.
4
Deliver
Write a dated price history and a side-by-side comparison straight into the buyers' Excel workflow, with alerts on anything unusual.

The architecture, briefly

Logins and anti-bot handling

Normalisation is where the value hides

Validation before anyone sees a number

Data is the new oil. It's valuable, but if unrefined it cannot really be used.

Clive Humby · Data science pioneer

That refining, the normalisation and validation, is exactly what turned a pile of scraped numbers into decisions worth real money.

The result: −25% costs, every day, automatically

Once the system was running, the difference was not subtle.

Before: hand-checked sample~3% of catalogue

After: validated daily pipeline100% of 100k+

What the buying team could actually act on, before and after. Coverage you can trust changes the decision, not just the dashboard.

100k+ products were monitored daily instead of a hand-picked sample, so the long tail stopped being invisible.
Purchasing costs dropped 25%, driven by always buying from the cheapest reliable source and spotting trends in the dated history that a manual process could never see.
The morning comparison landed in Excel before the team started work, with alerts when prices moved unusually, turning a daily chore into a glance.

The buyers stopped doing data entry and went back to making decisions, now backed by complete, fresh, validated numbers instead of a stale spreadsheet.

Where the 25% actually came from

The long tail stopped being invisible. Manual checking covered the obvious, high-volume products. The scraper covered everything, including thousands of low-attention items where small overpayments quietly added up across the whole catalogue.
Timing improved. With fresh data every morning, buying decisions matched the current cheapest reliable source instead of last week's. Prices move; acting on stale numbers leaves money on the table constantly.
History exposed patterns. A dated record of every price made trends visible: which suppliers cycle their pricing, when discounts appear, which products are volatile. You cannot see a pattern in a snapshot, only in a history, and the history is what turned reactive buying into informed buying.
Errors disappeared. Manual copy-paste mistakes that previously turned good deals into bad ones simply stopped, because no human was transcribing numbers any more.

The alerting that makes it actionable

What it takes to keep a system like this alive

A scraper like this is never "done." The sources change, and that is the job, not a bug:

Layout changes break crawlers. Monitoring catches them within a day, not a quarter, so a broken source never quietly poisons weeks of decisions.
New wholesalers can be added as another normalised source without touching the comparison logic, because the schema is shared.
Validation rules get tighter over time as new edge cases show up in real data. Every weird record that slips through becomes a new rule.

This is precisely why the Run phase exists, and why it is a retainer rather than a one-off. The value is recurring, so the relationship is too.

How I would approach the same problem for you

If you have a version of this, prices, stock, listings, competitor data spread across sources you check by hand, the path is the same shape every time:

1
Map it
A short audit to find where the cost actually leaks and which sources are worth automating. See the full process.
2
Prove it
A pilot on your real data, fast, so you see the value before committing a real budget.
3
Build and run it
The production pipeline, validated and monitored, then kept healthy as the sources change.

Frequently asked questions

Want the same for your data?

If you are checking prices, stock, or competitor data by hand today, there is almost always a version of this that runs while you sleep. That is the kind of project I take.

Have a process worth automating?

Tell me about it, I’ll reply within 24 hours.

Start a conversation

Related notes

Engineering

Web scraping at scale: the complete field guide to reliable data pipelines

March 12, 202616 min read

Playbook

From idea to MVP in two weeks: the complete guide to shipping fast without dropping quality

May 28, 202611 min read

Fetch

Parse

Validate

Deliver

Map it

Prove it

Build and run it

Have a process worth automating?

Web scraping at scale: the complete field guide to reliable data pipelines

From idea to MVP in two weeks: the complete guide to shipping fast without dropping quality

Fetch

Parse

Validate

Deliver

Map it

Prove it

Build and run it

Have a process worth automating?

Web scraping at scale: the complete field guide to reliable data pipelines

From idea to MVP in two weeks: the complete guide to shipping fast without dropping quality