How to Scrape Sitemaps to Discover Scraping Targets

Scrapfly - Apr 10 '23 - - Dev Community

How to Scrape Sitemaps to Discover Scraping Targets

To start - what are sitemaps? Sitemaps are web files that list the locations of web objects like product pages, articles etc. for web crawlers. It's mostly used to tell search engines what to index though in web scraping we can use them to discover targets to scrape.

In this tutorial, we'll take a look at how to discover sitemap files and scrape them using Python and Javascript. In this guide we'll cover:

  • Finding sitemap location.
  • How to understand and navigate sitemaps.
  • How to parse sitemap files using XML parsing tools in Python and Javascript.

For this, we'll be using Python or Javascript with a few community packages and some real-life examples. Let's dive in!

Why Scrape Sitemaps?

Scraping sitemaps is an efficient way to discover targets listed on the website be it product pages, blog posts or any other web object. Sitemaps are usually XML documents that list URLs by category in batches of 50 000. Sitemaps are often gzip compressed making it a low-bandwidth way to discover page URLs.

So, instead of scraping a website's search or directory which can be thousands of HTML pages, we can scrape sitemap files and get the same results in a fraction of time and bandwidth costs. Additionally, since sitemaps are created for web scrapers the likelihood of getting blocked is much lower.

Where to find Sitemaps?

Sitemaps are usually located at the root of the website /sitemap.xml file. For example, scrapfly.io/sitemap.xml of this blog.

However, there's no clear standard location so another way to find the sitemap location is to explore the standard /robots.txt file which provides instructions for web crawlers.

For example, scrapflys's /robots.txt file contains the following line:

Sitemap: https://scrapfly.io/sitemap.xml

Enter fullscreen mode Exit fullscreen mode

Many programming languages have robot.txt parsers built-in or it's as easy as retrieving the page content and finding the line with Sitemap: prefix:

# Python:
# built-in sitemap parser:
from urllib.robotparser import RobotFileParser

rp = RobotFileParser('http://scrapfly.io/robots.txt')
rp.read()
print(rp.site_maps())
['https://scrapfly.io/blog/sitemap-posts.xml', 'https://scrapfly.io/sitemap.xml']
# or httpx
import httpx
response = httpx.get('https://scrapfly.io/robots.txt')
for line in response.text.splitlines():
    if line.strip().startswith('Sitemap:'):
        print(line.split('Sitemap:')[1].strip())
Enter fullscreen mode Exit fullscreen mode
// Javascript:
const axios = require('axios');

const response = await axios.get(url);
const sitemaps = [];
const lines = robotsTxt.split('\n');
for (const line of lines) {
    const trimmedLine = line.trim();
    if (trimmedLine.startsWith('Sitemap:')) {
        sitemaps.push(trimmedLine.split('Sitemap:')[1].trim());
    }
}
console.log(sitemaps);

Enter fullscreen mode Exit fullscreen mode

Scraping Sitemaps

Sitemaps are XML documents with specific, simple structures. For example:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//scrapfly.io/blog/sitemap.xsl"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://scrapfly.io/blog/web-scraping-with-python/</loc>
    <lastmod>2023-04-04T06:20:35.000Z</lastmod>
    <image:image>
      <image:loc>https://scrapfly.io/blog/content/images/web-scraping-with-python_banner.svg</image:loc>
      <image:caption>web-scraping-with-python_banner.svg</image:caption>
    </image:image>
  </url>
</urlset>

Enter fullscreen mode Exit fullscreen mode

When it comes to web scraping we're mostly interested in the <urlset> values where each <url> contains location and some optional metadata like:

  • <loc> - URL location of the resource.
  • <lastmod> - Timestamp when the resource was modified last time.
  • <changefreq> - How often the resource is expected to change (always, hourly, daily, weekly, monthly, yearly or never).
  • <priority> - resource indexing importance (0.0 - 1.0). This mostly used for search engines.
  • <image:image> - A list of images associated with the resource.
  • <image:loc> - The URL of the image (e.g. main product image or article feature image).
  • <image:caption> - A short description of the image.
  • <image:geo_location> - The geographic location of the image.
  • <image:title> - The title of the image.
  • <image:license> - The URL of the license for the image.
  • <video:video> - similar to <image:loc> but for videos.

To parse sitemaps we need an XML parser which we can use to find all <url> elements and extract the values.

Python

Javascript

In Python, we can use lxml or parsel packages to parse XML documents with XPath or CSS selectors.

from parsel import Selector
import httpx

response = httpx.get("https://scrapfly.io/blog/sitemap-posts.xml")
selector = Selector(response.text)
for url in selector.xpath('//url'):
    location = url.xpath('loc/text()').get()
    modified = url.xpath('lastmod/text()').get()
    print(location, modified)
    ["https://scrapfly.io/blog/how-to-scrape-nordstrom/", "2023-04-06T13:35:47.000Z"]

Enter fullscreen mode Exit fullscreen mode

In javascript (as well as NodeJS and TypeScript), the most popular XML parsing library is cheerio which can be used with CSS selectors:

// npm install axios cheerio
import axios from 'axios';
import * as cheerio from 'cheerio';

const response = await axios.get('https://scrapfly.io/blog/sitemap-posts.xml');
const $ = cheerio.load(response.data, { xmlMode: true });
$('url').each((_, urlElem) => {
    const location = $(urlElem).find('loc').text();
    const modified = $(urlElem).find('lastmod').text();
    console.log(location, modified);
});

Enter fullscreen mode Exit fullscreen mode

For bigger websites, the sitemap limit of 50 000 URLs is often not enough so there are multiple sitemap files contained in a sitemap hub. To scrape this we need a bit of recursion.

For example, let's take a look at StockX sitemap hub. To start, we can see that robots.txt file contains the following line:

Sitemap: https://stockx.com/sitemap/sitemap-index.xml

Enter fullscreen mode Exit fullscreen mode

This index file contains multiple sitemaps:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-0.xml</loc>
</sitemap>
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-1.xml</loc>
</sitemap>
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-2.xml</loc>
</sitemap>
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-3.xml</loc>
</sitemap>
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-4.xml</loc>
</sitemap>
<sitemap>
  <loc>https://stockx.com/sitemap/sitemap-5.xml</loc>
</sitemap>
</sitemapindex>

Enter fullscreen mode Exit fullscreen mode

Sometimes, sitemap hubs split sitemaps by category, location which can help us to target specific types of pages. Though to scrape hubs all we have to do is parse the index file and scrape each sitemap file:

Python

Javascript

from parsel import Selector
import httpx

response = httpx.get("https://stockx.com/sitemap/sitemap-index.xml")
selector = Selector(response.text)
for sitemap_url in selector.xpath("//sitemap/loc/text()").getall():
  response = httpx.get(sitemap_url)
  selector = Selector(response.text)
  for url in selector.xpath('//url'):
      location = url.xpath('loc/text()').get()
      change_frequency = url.xpath('changefreq/text()').get() # it's an alternative field to modification date
      print(location, change_frequency)


import axios from 'axios';
import * as cheerio from 'cheerio';

const response = await axios.get('https://stockx.com/sitemap/sitemap-index.xml');
const sitemapIndex = cheerio.load(response.data, { xmlMode: true });
sitemapIndex('loc').each(async (_, urlElem) => {
    const sitemapUrl = sitemapIndex(urlElem).text();
    const response = await axios.get(sitemapUrl);
    const $ = cheerio.load(response.data, { xmlMode: true });
    $('url').each((_, urlElem) => {
        const location = $(urlElem).find('loc').text();
        const changefreq = $(urlElem).find('changefrq').text();
        console.log(location, changefreq);
    });
});

Enter fullscreen mode Exit fullscreen mode

Summary

In this quick introduction, we've taken a look at the most popular way to discover scraping targets - the sitemap system. To scrape it we used an HTTP client (httpx in Python or axios in Javascript) and XML parser (lxml or cheerio).

Sitemaps are a great way to discover new scraping targets and get a quick overview of the website structure. Though, not every website supports them and it's worth noting that the data on sitemaps can be more dated than on the website itself.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .