Prologue

Python is great for AI software, but I prefer NodeJS for crawling with a headless browser, IMO python has some good libraries for parsing the content but puppeter is slightly more powerful when looking at headless browsers.
That said I wanted to do an experiment and try to pull some data from a page without going through the tedious process of checking every element of interest and discarding the fluff.
The nice thing about puppeteer is also that you don't need to reverse engineer the requests going into the page, figure out cookies and whatnot.

Why?

Web/data scraping is pretty cool, it gives you the power to combine different data sources and devise interesting ways to draw conclusions about it. It is even more interesting these days with ChatGPT, you can simply dump some data and ask it what it can do with it, or if it can extract usable info from it. That is in fact quite nice.

What?

We're going to be looking at a listing site for rentals in Italy, picked something at random for no real reason, mostly to see if it works.

Let's plan this out:

grab the content
identify the listings wrapper and grab the innerHTML of it
pass in the HTML to gpt-4-turbo via the API and construct a dialogue with it to extract the data we are looking for

How?

We're going to install a few dependencies:

# we like speed so while npm is nice, pnpm is faster
npm i -g pnpm
pnpm i openai puppeteer-extra dotenv

Then figure out the main listing element, that at the time of writing was something easy like .in-realEstateResults.

Some quick docs fumbling and you get a working script for loading the page and grabbing the innerHTML of the listing element like so:

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
  );
  const element = await page.$(".in-realEstateResults");
  const innerHtml = await page.evaluate(
    (element) => element.innerHTML,
      element,
  );

The openai library api is quite straightforward and they offer quite a few examples too. The library provides some responses and also some metrics about the usage you had for your specific query.

  const response = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: `Can you extract the property prices out of this
            html and send me the results? The text is in italian so
            you should translate that. Can you send me the output as JSON?`,
          },
          {
            type: "text",
            text: propertyInfoHtml,
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
  console.log(response.usage);

Another extremely cool thing is that you can simply tell the model to extract data as json and what you will have done is convert html to a json api. Basically you have a magicbox that you can tell what you want to extract from some structure. You simply have a piece of flexible code that can extract data from various html pages with different selectors, so no more hassle figuring that stuff out. And the output is quite well structured and would require minimal after-processing.

[\n' +
      '    {\n' +
      '        "price": "€ 115.000",\n' +
      '        "description": "Bilocale via Felice Casorati 33, Borgo Venezia, Verona",\n' +
      '        "rooms": "2 locali",\n' +
      '        "size": "60 m²",\n' +
      '        "bathroom": "1 bagno",\n' +
      '        "floor": "Piano 1",\n' +
      '        "elevator": "Ascensore",\n' +
      '        "balcony": "Balcone"\n' +
      '    },\n' +
      '    {\n' +
      '        "price": "€ 120.000",\n' +
      '        "description": "Trilocale via Arnaldo Da Brescia, 27, Porto San Pancrazio, Verona",\n' +
      '        "rooms": "3 locali",\n' +
      '        "size": "77 m²",\n' +
      '        "bathroom": "1 bagno",\n' +
      '        "floor": "Piano R",\n' +
      '        "elevator": "No Ascensore",\n' +
      '        "cellar": "Cantina"\n' +
      '    },\n
...

How much though?

This processing though can quickly get out of hand even with the low pricing, so you want to cache results. You also need to pay attention to the context window of the model you are using. The last line with response.usage tells you how many tokens you have used in the query.

 { prompt_tokens: 30337, completion_tokens: 972, total_tokens: 31309 }

This works no problem with gpt-4, but smaller models that you can host locally for example might have an issue with this. There are pages out there that can be quite large also.

Trimming the fat

If you think about it a big part of html is non structure defininig. What that means is that it is either display information(css) or interaction capabilities(events and such). The interesting part is that it does not have any true value for the data which is what we are after.
What if we removed all of that? Crazy idea right?

pnpm i jsdom

So let's add an intermediate step before sending the html out to OpenAI:

    const dom = new JSDOM(innerHtml);
    spinner.text = "Cleaning up HTML";
    const document = dom.window.document;
    const elements = document.querySelectorAll("*");
    elements.forEach((element) => {
      Array.from(element.attributes).forEach((attribute) => {
        element.removeAttribute(attribute.name);
      });
    });
    const cleanedHtml = dom.serialize();

And presto, the metrics of the trimmed down version:

{ prompt_tokens: 7790, completion_tokens: 586, total_tokens: 8376 }

Pretty handy...less than one third.

Conclusions

the gpt-4-turbo apis are definitely worth exploring, 15 minutes exploration is not doing it justice
you can tweak the content and let go of everything that is not important for your query
caching might work in some cases depending on what data you want to pull
slimming down the context actually seems to make quite a difference

Hacking out an AI spider with Node