Intoli Blog

Intoli Smart Proxies

Want to use the smartest web scraping proxies available?

Get started now and find out why Intoli is the best in the business!

The Red Tide and the Blue Wave: Gerrymandering as a Risk vs. Reward Strategy

As election day has approached, I’ve increasingly heard the phrase “blue wave” thrown around in news articles, forums, and even everyday discussions. The term is commonly understood to mean that high Democratic turnout in the midterms could lead to significant Republican losses in the House. It’s clearly true that unusually high voter turnout within a single party will help that party’s chances on election day, but there’s a bit more to it than that.

Continue reading

Performing Efficient Broad Crawls with the AOPIC Algorithm

This article explains how the Adaptive On-Line Page Importance Computation (AOPIC) algorithm works. AOPIC is useful for performing efficient broad crawls of large slices of the internet. The key idea behind the algorithm is that pages are crawled based on a continuously improving estimate of page importance. This effectively allows the user of the algorithm to allocate the bulk of their limited bandwidth on the most important pages that their crawler encounters.

Continue reading

Breaking Out of the Chrome/WebExtension Sandbox

WebExtensions are a frequently underappreciated tool for the purposes of web scraping and browser automation. They provide an easy way to access an extremely powerful API that’s cross browser compatible out of the box, and that API provides functionality that extends far beyond that of more specialized automation APIs like the Chrome DevTools Protocol or Firefox’s Marionnette. For example, the WebExtensions API provides a mechanism for containerizing individual tabs–Selenium and Puppeteer can’t do that!

Continue reading

User-Agents — Generating random user agents using Google Analytics and CircleCI

If you’re in a hurry, you can head straight to the user-agents repository for installation and usage instructions! While web scraping, it’s usually a good idea to create traffic patterns consistent with those that a human user would produce. This of course means being respectful and rate-limiting requests, but it often also means concealing the fact that the requests have been automated. Doing so helps avoid getting blocked by overzealous DDOS protection services, and allows you to successfully scrape the data that you’re interested in while keeping site operators happy.

Continue reading

How F5Bot Slurps All of Reddit

In this guest post, Lewis Van Winkle talks about F5Bot, a free service that emails you when selected keywords are mentioned on Reddit, Hacker News, or Lobsters. He explains in detail how F5Bot is able to process millions of comments and posts from Reddit every day on a single VPS. You can check out more of Lewis Van Winkle’s writing at codeplea.com, and his open source contributions at github.com/codeplea.

Continue reading

No API Is the Best API — The elegant power of Power Assert

One of the core ideas behind Facebook’s React library is that there should be no need to learn a new API for things that you already know how to do in vanilla JavaScript. Why bother memorizing Angular’s ng-repeat syntax when you can just use good-old Array.map()? That’s a good idea, and it’s a big part of what made the project so appealing to developers in the first place. So then why should Jest–the same company’s JavaScript testing framework, and a popular choice among React developers–encourage you to learn a new assertion API, and to write code like

Continue reading

Recreating Python's Slice Syntax in JavaScript Using ES6 Proxies

I’ve noticed that JavaScript proxies seem to have been getting an increasing amount of attention recently. They were introduced by ECMAScript 2015 (ES6) several years ago, but they remain one of the less well-known features of the language. That’s a real shame because proxies are pretty awesome. They give you a level of flexibility that simply didn’t exist previously in JavaScript, and have allowed for projects like Remote Browser to become possible.

Continue reading

A Slack Community for Developers to Discuss Web Scraping

The Web Scrapers Slack Community Want to link up with other developers interested in web scraping? Join the Web Scrapers Slack Channel to chat about Selenium, Puppeteer, Scrapy, or anything else related to web scraping. Invite Me! The last few years have been a very exciting time for web scraping. In that period, both Chrome and Firefox have introduced memory efficient headless modes which allow them to run on Linux servers without requiring X11 and a virtual framebuffer like xvfb.

Continue reading

Building a YouTube MP3 Downloader with Exodus, FFmpeg, and AWS Lambda

Let’s focus on the easy part first: what we’ll be building in this tutorial. The end result will be a browser bookmarklet which can be used to convert YouTube videos to MP3s and download them. The basic interaction flow is that you click on the bookmarklet while on the page for a specific video, a new tab opens and displays a progress bar for the conversion, and then the download starts automatically as soon as it’s ready.

Continue reading

Running FFmpeg on AWS Lambda for 1.9% the cost of AWS Elastic Transcoder

Building a Media Transcoder with Exodus, FFmpeg, and AWS Lambda When delivering media content over the internet, it’s important to keep in mind that factors like network bandwidth, screen resolution, and codec support will vary drastically between different devices and connections. Certain media encodings will be better suited for certain viewers, and transcoding source media to multiple formats is a must in order to ensure that you’re delivering the best possible experience to your users.

Continue reading