A few months back, I wrote a popular article called Making Chrome Headless Undetectable in response to one called Detecting Chrome Headless by Antione Vastel. The one thing that I was really trying to get across in writing that is that blocking site visitors based on browser fingerprinting is an extremely user-hostile practice. There are simply so many variations in browser configurations that you’re inevitably going to end up blocking non-automated access to your website, and–on top of that–you’re really not accomplishing anything in terms of blocking sophisticated web scrapers.
Have a great idea for an article?
We're always looking for guest contributors or article suggestions. Shoot us an email at email@example.com, we would love to hear yours!
Browser automation frameworks–like Puppeteer, Selenium, Marionette, and Nightmare.js–strive to provide rich APIs for configuring and interacting with web browsers. These generally work quite well, but you’re inevitably going to end up running into API limitations if you do a lot of testing or web scraping. You might find yourself wanting to conceal the fact that you’re using a headless browser, extract image resources from a web page, set the seed for Math.
In this post, I will highlight a few ways to save images while scraping the web through a headless browser. The simplest solution would be to extract the image URLs from the headless browser and then download them separately, but what if that’s not possible? Perhaps the images you need are generated dynamically or you’re visiting a website which only serves images to logged-in users. Maybe you just don’t want to put unnecessary strain on their servers by requesting the image multiple times.
The easiest way to install the latest Chrome version on RHEL, CentOS, and Amazon Linux versions 6.X and 7.X. # This installs Chrome on any RHEL/CentOS/Amazon Linux variant. curl https://intoli.com/install-google-chrome.sh | bash A Universal Installation Script for Google Chrome on Amazon Linux and CentOS 6 CentOS, Amazon Linux AMI, and Red Hat Enterprise Linux are three closely related GNU/Linux distributions which are all popular choices for server installations. They offer excellent performance and stability, but package availability can often be lacking.
Detecting Headles Chrome A short article titled Detecting Chrome Headless popped up on Hacker News over the weekend and it has since been making the rounds. Most of the discussion on Hacker News was focused around the author’s somewhat dubious assertion that web scraping is a “malicious task” that belongs in the same category as advertising fraud and hacking websites. That’s always a fun debate to get into, but the thing that I really took issue with about the article was that it implicitly promoted the idea of blocking users based on browser fingerprinting.
UPDATE: This article is updated regularly to reflect the latest information and versions. If you’re looking for instructions then skip ahead to see Setup Instructions. NOTE: Be sure to check out Running Selenium with Headless Chrome in Ruby if you’re interested in using Selenium in Ruby instead of Python. Background It has long been rumored that Google uses a headless variant of Chrome for their web crawls. Over the last two years or so it had started looking more and more like this functionality would eventually make it into the public releases and, as of this week, that has finally happened.
Running Google Chrome with an extension installed is quite simple because Chrome supports a --load-extension command-line argument for exactly this purpose. This can be specified before launching Chrome with Selenium by creating a ChromeOptions instance and calling add_argument(). from selenium import webdriver from selenium.common.exceptions import NoSuchElementException # Configure the necessary command-line option. options = webdriver.ChromeOptions() options.add_argument('--load-extension=path/to/the/extension') # Initalize the driver with the appropriate options. driver = webdriver.Chrome(chrome_options=options) The above code will setup a Selenium driver for Chrome with the extension located at path/to/extension preinstalled.
NOTE: Be sure to check out Running Selenium with Headless Chrome if you’re interested in using Selenium in Python instead of Ruby. Since Google added support to run Chrome and Chromium in headless mode as of version 59, it has become a popular choice for both testing and web scraping. There are a few Chrome-specific automation solutions out there, such as Puppeteer and Chrome Remote Interface, but Selenium remains a popular choice due to it’s uniform API across web browsers and it’s support for multiple programming languages.
Sometimes during the course of testing or web scraping with Google Chrome, you might desire to clear the browser cache and cookies with Selenium. You can of course call driver.close() on your current ChromeDriver instance and then provision a new one. The fresh instance of Chrome will start with a clean browser history, cookies, and cache. There are however times when this method loses other state that you may want to preserve.