Puppeteer

Have a great idea for an article?

We're always looking for guest contributors or article suggestions. Shoot us an email at blog@intoli.com, we would love to hear yours!

It is *not* possible to detect and block Chrome headless

A few months back, I wrote a popular article called Making Chrome Headless Undetectable in response to one called Detecting Chrome Headless by Antione Vastel. The one thing that I was really trying to get across in writing that is that blocking site visitors based on browser fingerprinting is an extremely user-hostile practice. There are simply so many variations in browser configurations that you’re inevitably going to end up blocking non-automated access to your website, and–on top of that–you’re really not accomplishing anything in terms of blocking sophisticated web scrapers.

Continue reading

JavaScript Injection with Selenium, Puppeteer, and Marionette in Chrome and Firefox

Browser automation frameworks–like Puppeteer, Selenium, Marionette, and Nightmare.js–strive to provide rich APIs for configuring and interacting with web browsers. These generally work quite well, but you’re inevitably going to end up running into API limitations if you do a lot of testing or web scraping. You might find yourself wanting to conceal the fact that you’re using a headless browser, extract image resources from a web page, set the seed for Math.

Continue reading

Saving Images from a Headless Browser

In this post, I will highlight a few ways to save images while scraping the web through a headless browser. The simplest solution would be to extract the image URLs from the headless browser and then download them separately, but what if that’s not possible? Perhaps the images you need are generated dynamically or you’re visiting a website which only serves images to logged-in users. Maybe you just don’t want to put unnecessary strain on their servers by requesting the image multiple times.

Continue reading

Using Puppeteer to Scrape Websites with Infinite Scrolling

Infinite scrolling has become a ubiquitous design pattern on the web. Social media sites like Facebook, Twitter, and Instagram all feature infinitely scrolling feeds to keep users engaged with an essentially unbounded amount of content. Here’s what that looks like on Instagram, for example. This mechanism is typically implemented by using JavaScript to detect when the user has scrolled far enough down the existing feed, and then querying an underlying API endpoint for the next batch of data that gets processed and dynamically injected into the page.

Continue reading