By Evan Sangaline | August 1, 2018

Follow @sangaline Star

A wise man once said that sitemaps are the window into a website’s soul, and I’m not inclined to disagree. Without a sitemap, a website is just a labyrinthian web of links between pages. It’s certainly possible to scrape sites by crawling those links, but things become much easier with a sitemap that lays out a site’s content in clear and simple terms. Sites which provide sitemaps are quite literally asking to be scraped; it’s a direct indication that the site operators intend for bots to visit the pages listed in the sitemaps.

Most web scraping libraries provide built-in mechanisms for parsing sitemaps and processing the listed pages. For example, Scrapy includes a generic SitemapSpider for this purpose, and simplecrawler automatically discovers resources from sitemaps. These are extremely useful once you’re at the stage of actually scraping a website, but it can also sometimes be useful to quickly parse a site’s sitemap to get an idea of the size of the website and the scope of the scraping endeavor at hand.

Sitemaps are basically just XML files which enumerate the pages available to scrape on a website, and they’re generally quite simple. Things can get slightly more complicated when sitemaps index additional sitemaps, or when they’re explictly compressed using gzip, but they’re overall fairly straightforward to deal with. In fact, you can generally parse them and extract them using standard command-line utilities without any need for specialized tools.

Let’s take the sitemap for Google Play as an example. That site has a lot of pages, and it has a particularly complex sitemap structure that involves both nested and compressed sitemaps. Still, even the Google Play sitemap can be parsed with a simple bash “one-liner.”

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs -n1 curl -N |
    gunzip |
    grep -oP '<loc>\K[^<]*' |
    gzip > \
    play-store-urls.txt.gz

OK, maybe “simple” is pushing it a little bit, but it’s really not that bad when you break it down step by step. And that’s exactly what we’ll do here: break this chain of piped commands up into easily understandable steps.

The Breakdown

The first step towards extracting a list of URLs from a site is to parse its robots.txt file where the sitemap(s) will be listed. We’ll use curl here to download the file and print out its contents to the terminal. We use the -N option here to disable buffering so that the entire file gets output at once.

curl -N https://play.google.com/robots.txt

There’s a bunch of boring stuff that get printed out when you run this command, but the interesting part is the list of sitemaps.

# Sitemap files
Sitemap: https://play.google.com/sitemaps/sitemaps-index-0.xml
Sitemap: https://play.google.com/sitemaps/sitemaps-index-1.xml
Sitemap: https://play.google.com/sitemaps/sitemaps-index-2.xml

We can pipe the full output of the robots.txt file through sed in order to extract these initial sitemap URLs. We’ll use sed’s -n argument to suppress printing out each line automatically, and then check for a pattern of ^Sitemap: (.*)$ to find the sitemap URLs. The .* inside of the parenthesis grabs all of the text where we expect the sitemap URL to be, and the parenthesis themselves specify that this is a matching group. We’ll then use \1 to replace the entire line with the contents of this first matched group, and use sed’s p command to print out the replacement.

curl -N https://play.google.com/robots.txt |
    sed -n 's/Sitemap: \(.*\)/\1/p'

Running this command will print out the list of sitemaps that we’ll want to download and nothing else.

https://play.google.com/sitemaps/sitemaps-index-0.xml
https://play.google.com/sitemaps/sitemaps-index-1.xml
https://play.google.com/sitemaps/sitemaps-index-2.xml

We can then pipe these URLs into xargs to run curl again with each URL passed as an argument. The -n1 argument to xargs here simply tells it to execute curl separately for each URL that’s piped in.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    xargs -n1 curl -N

The only problem is that when we do this, we end up with a somewhat cryptic error from curl.

curl: (3) Illegal characters found in URL
curl: (3) Illegal characters found in URL
curl: (3) Illegal characters found in URL

The reason that we get this error is that Google isn’t just using \n to indicate a new line in their sitemap files; they’re using \n for a new line and \r as a carriage return. I can only assume that this is because Google is running MS DOS on their servers, but, in any case, the carriage returns don’t play nicely with Unix utilities. If we use sed again to eliminate the \r carriage returns before passing them as arguments to curl, then we’ll be able to actually download and print the sitemap files.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N

The content of these sitemap files don’t include any additional character returns; they don’t even include new lines. We’re instead left with a mess of XML that’s a bit dense to parse by eye.

Sitemaps for Days

Newlines with carriage returns, or nothing. I guess there’s no inbetween with Google. If we clean this up manually, it looks more like this.

<sitemapindex>
    <sitemap>
        <loc>https://play.google.com/sitemaps/play_sitemaps_2018-06-29_1530331247-00000-of-67773.xml.gz</loc>
    </sitemap>
    <sitemap>
        <loc>https://play.google.com/sitemaps/play_sitemaps_2018-06-29_1530331247-00000-of-67781.xml.gz</loc>
    </sitemap>
    <sitemap>
        <loc>https://play.google.com/sitemaps/play_sitemaps_2018-06-29_1530331247-00001-of-67773.xml.gz</loc>
    </sitemap>
    <!--- And 135,551 more that I've removed for brevity. -->
</sitemapindex>

We can see here that these initial sitemaps don’t actually list pages on the site. They just list more sitemaps, and ones with a .gz suffix to boot. This indicates that the files are gzip compressed, and that we’ll therefore need to decompress them before accessing their contents. Before we get to that, let’s first extract this second round of sitemap URLs.

To do so, we’ll use grep’s -P argument to enable Perl-Compatible Regular Expression (PCRE) matching, and the -o argument to only output the regular expression matches. The regular expression that we’ll use is <loc>\K[^<]* which will match the new sitemap URLs from the sitemap indexes. The <loc> part indicates that the match should begin with an opening <loc> tag, and the \K corresponds to PCRE Keep and means that we shouldn’t include the preceding <loc> in the final match’s text. We then use [^<]* to indicate that the rest of the match should include everything leading up to the < character which will be the start of the closing </loc> tag.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*'

This will eliminate all of the XML wrapping the secondary sitemap locations, and print out a cool 135,554 sitemap URLs that we’ll need to follow up on. The first few look somthing like this, but–again–there are 135,554 of these sitemaps that we’ll still need to deal with.

https://play.google.com/sitemaps/play_sitemaps_2018-06-30_1530374467-00000-of-67795.xml.gz
https://play.google.com/sitemaps/play_sitemaps_2018-06-30_1530374467-00000-of-67802.xml.gz
https://play.google.com/sitemaps/play_sitemaps_2018-06-30_1530374467-00001-of-67795.xml.gz
https://play.google.com/sitemaps/play_sitemaps_2018-06-30_1530374467-00001-of-67802.xml.gz
https://play.google.com/sitemaps/play_sitemaps_2018-06-30_1530374467-00002-of-67795.xml.gz

The next step is to actually download these compressed sitemaps. We can use the xargs/curl combination here again to pass each URL to curl and print out its contents. The contents are gzip compressed though, so we’ll also need to use gunzip in order to extract the actual text content of the files. The gzip RFC specifies that gzipped members are simply concatenated, so we can pipe the compressed content of each file into a single gunzip command without needing to use xargs or anything else to handle them individually.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs -n1 curl -N |
    gunzip

This command will output another mess of XML with neither carriage returns nor new lines, but the formatted version looks something like this.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://play.google.com/store/books/collection/books_clusters_mrl_n_8CA46B8C_382D58C5_CE7BC9ED</loc>
    <xhtml:link rel="alternate" hreflang="en-ca" href="https://play.google.com/store/books/collection/books_clusters_mrl_n_8CA46B8C_382D58C5_CE7BC9ED"/>
    <xhtml:link rel="alternate" hreflang="fr-ca" href="https://play.google.com/store/books/collection/books_clusters_mrl_n_8CA46B8C_382D58C5_CE7BC9ED"/>
  </url>
  <url>
    <loc>https://play.google.com/store/books/collection/books_clusters_mrl_rt_40B9E3CD_A946DBCB_585EC943</loc>
    <xhtml:link rel="alternate" hreflang="es-pe" href="https://play.google.com/store/books/collection/books_clusters_mrl_rt_40B9E3CD_A946DBCB_585EC943"/>
    <xhtml:link rel="alternate" hreflang="qu-pe" href="https://play.google.com/store/books/collection/books_clusters_mrl_rt_40B9E3CD_A946DBCB_585EC943"/>
  </url>
  <url>
    <loc>https://play.google.com/store/books/collection/books_clusters_mrl_rt_A5F0A777_397BC59E_7532300C</loc>
    <xhtml:link rel="alternate" hreflang="en-nz" href="https://play.google.com/store/books/collection/books_clusters_mrl_rt_A5F0A777_397BC59E_7532300C"/>
    <xhtml:link rel="alternate" hreflang="mi-nz" href="https://play.google.com/store/books/collection/books_clusters_mrl_rt_A5F0A777_397BC59E_7532300C"/>
  </url>
  <!--- And *a lot* more. -->
<urlset>

We can see here that we’ve graduated past the stage of sitemap indexing, and that we’re finally dealing with actual site URLs. There are some secondary internationalization URLs here, but he main URLs are again contained within <loc> tags. This means that we can use the same grep -oP '<loc>\K[^<]* command that we used earlier to extract them.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs curl -N |
    gunzip |
    grep -oP '<loc>\K[^<]*'

This is enough to print out all of the URLs in Google Play’s plethora of site maps, but it’s a good idea to to compress and store them on disk because the list of URLs will be gigantic. We’ll use gzip to perform the compression, and then simply pipe these into a filename of our own choosing.

curl -N https://play.google.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs -n1 curl -N |
    gunzip |
    grep -oP '<loc>\K[^<]*' |
    gzip > \
    play-store-urls.txt.gz

This gives us the full command that we started with, and will produce a file with a complete listing of URLs on Google Play. You can inspect the compressed contents by using the zless pager. Running zless play-store-urls.txt.gz will show you the decompressed contents, which should consist of a list of page URLs. The first handful should look something like this.

https://play.google.com/store/books/collection/books_clusters_mrl_o_nav_4DDEA8BE_18CF3FC0_111EF06D
https://play.google.com/store/books/collection/books_clusters_mrl_x_6810D36C_090EBB71_5339030B
https://play.google.com/store/books/collection/books_clusters_mrl_rt_3F080C6A_B914F37D_FF021EC3
https://play.google.com/store/books/collection/books_clusters_mrl_rt_B0B21206_9DC77E4F_2BCB716C
https://play.google.com/store/books/collection/books_clusters_mrl_rt_CEB95900_CD7F92B8_F8375BBD

Conclusion

Well, that’s all there is to it! Web scraping libraries certainly have their place, but sometimes all you need are some basic command-line utilities to get a quick overview of the content available on a website. This approach even scales to some of the biggest websites on the internet, such as Google Play. If you enjoy chaining together command-line utilities like this to accomplish seemingly complex tasks, then I highly recommend checking out the Advanced Bash-Scripting Guide. It’s nominally focused on scripting, but it’s really an excellent introduction to the Unix command-line in general.

The Breakdown

Conclusion

Suggested Articles

Performing Efficient Broad Crawls with the AOPIC Algorithm

User-Agents — Generating random user agents using Google Analytics and CircleCI

How F5Bot Slurps All of Reddit

Comments

Search

Tags

Scraping and Parsing Sitemaps in Bash

The Breakdown

Conclusion

Suggested Articles

Performing Efficient Broad Crawls with the AOPIC Algorithm

User-Agents — Generating random user agents using Google Analytics and CircleCI

How F5Bot Slurps All of Reddit

Comments

Search

Tags