Intro
When crawling the web nowadays most web pages will be SPAs and use various JS frameworks and libraries that render dynamically. This means that the easy way to crawl is by using some type of headless browser.
There are several options that I know for doing this:
- Selenium
- playwright
- puppeteer
- ...more that I am probably unaware of
For the sake of simplicity I have chosen not to look at things like cypress since the focus is not the testing but more the automation.
I will focus mostly on puppeteer.
How it works
Puppeteer communicates with chromium using the CDP via a websocket. Theoretically this is possible not only in nodejs
but any programming language, but in practice the most comprehensive implementation is what the fine people working on puppeteer have done. What that means is that you have access to many of the features that are accessible from chrome (cookies, storage, dom, screenshots etc...).
The controversial case for web scraping
Web scraping is a bit of a controversial topic and many website tend to clamp down on automatic browsing.
There are a wide range of methods to figure out if a visitor is real or one of our machine overlords. It varies from checking browser capabilities, cookies, captchas and even more advanced behavioral analysis.
Warning: past this point proceed at your own risk
A way to get an overview of what your current browser capabilities are...that some websites might look at, and block you if you don't play nice can be found here.
The following snippet shows how to check puppets' default profile.
import puppeteer from 'puppeteer'
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://bot.sannysoft.com')
await page.screenshot({ path: './screenshots/testresult.png', fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
Now if you want to get the site believe you you are playing nice, you need to find a way to get this check passing you need a few more modules.
# with pnpm you can install the required as follows
pnpm i puppeteer-extra puppeteer-extra-plugin-stealth
And then do the same but using the stealth plugin.
import puppeteer from 'puppeteer-extra'
// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from 'puppeteer-extra-plugin-stealth'
puppeteer.use(StealthPlugin())
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log('Running tests..')
const page = await browser.newPage()
await page.goto('https://bot.sannysoft.com')
await page.screenshot({ path: './screenshots/testresult.png', fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
Conclusions
- When crawling you should behave as a human would
- There is no way to fully pretend...but it is fun to try.
- Be polite and don't do this a mega scale so that you don't crash servers