Web Scraping

Have a great idea for an article?

We're always looking for guest contributors or article suggestions. Shoot us an email at blog@intoli.com, we would love to hear yours!

Analyzing One Million robots.txt Files

One Million robots.txt Files The idea for this article actually started as a joke. We do a lot of web scraping here at Intoli and we deal with robots.txt files, overzealous ip bans, and all that jazz on a daily basis. A while back, I was running into some issues with a site that had a robots.txt file which was completely inconsistent with their banning policies, and I suggested that we should do an article on analyzing robots.

Continue reading

Scraping User-Submitted Reviews from the Steam Store

This article was originally published as a guest post on ScrapingHub’s blog. ScrapingHub is the company that wrote Scrapy, which this article is about, so read on to see why they liked it! Introduction The Steam game store is home to more than ten thousand games and just shy of four million user-submitted reviews. While all kinds of Steam data are available either through official APIs or other bulk-downloadable data dumps, I could not find a way to download the full review dataset.

Continue reading

Finding Pareto Optimal Blogs on Hacker News

Introduction I’ve been doing a lot of technical writing recently and, with that experience, I’ve grown to more deeply appreciate the writing of others. It’s easy to take the effort behind an article for granted when you’ve grown accustomed to there being new high-quality content posted every day on Hacker News and Twitter. The truth is that a really good article can take days or more to put together and it isn’t easy to write even one article that really takes off, let alone a steady stream of them.

Continue reading