A client in the retail analytics space needed real-time price intelligence across 20 major e-commerce platforms. The challenge: dynamic JS-rendered pages, aggressive anti-bot systems, and a need to process over 500,000 product listings every day.
What I Built
I designed a distributed scraping pipeline using Scrapy for the core framework and Playwright for JavaScript-heavy pages. Celery workers manage job queuing, and a PostgreSQL database stores the full price history with indexed lookups for instant comparison queries.
Key Features
- Automatic change detection โ only stores records when price or availability changes, keeping the DB lean
- Proxy rotation with residential IPs to avoid rate limits and blocks
- Scheduled daily runs via Celery Beat with retry logic on failure
- REST API layer for client dashboard to query price trends over any time range
Results
The system has been running in production for 8+ months with 99.7% uptime, processing roughly 520,000 listings per 24-hour cycle. The client now uses it as the backbone of their competitive pricing strategy.