Recently, I built an industrial scale web scraper. Here's what I learned.
1. Why build a scalable scraper/crawler?
- Google's primary product (their Search Engine) is empowered by web scrapers & crawl extracting data from the internet at an unfathomable level of scale.
- Open AI's capability (and willingness) to access data using scrapers & crawlers at internet wide scale is what empowered them to build (and continually improve) ChatGPT.
- Unlike last decade, intelligence is something you can build, use, and sell with the one catch being you require an immense amount of one resource to do so and that resource is a hell of a lot of data.
*2. Using chromium programmatically is helpful (I chose Puppeteer)
*
3. Industrial scale requires using proxies (I rotated between residential proxies)
*4. Bots can find rules via a robots.txt file for a site (Ask SEO experts about it)
*
5. Bypassing captchas, although ethically questionable, doesn't seem to be an illegal act to program your robot to take. (I explored Github python programs that were capable of this to satisfy my own curiosity).