Runnng Selenium with the new headless Chrome

By Evan Sangaline | April 14, 2017

Background

It has long been rumored that Google uses a headless variant of Chrome for their web crawls. Over the last two years or so it had started looking more and more like this functionality would eventually make it into the public releases and, as of this week, that has finally happened. With versions 59 and onwards, it will now be possible to harness the power of V8, Blink, and the rest of Chrome in a non-graphical server environment.

This may not sound earth-shattering if you don’t deal with headless browsers very often but it’s actually a pretty big deal. To put this into context: PhantomJS, one of the current leaders in the space, has over 21k stars on GitHub and is used by companies such as Netflix and Twitter for both unit and performance testing. Vitaly Slobodin, the former maintainer of PhantomJS, had this to say after hearing the news.

I think people will switch to it, eventually. Chrome is faster and more stable than PhantomJS. And it doesn’t eat memory like crazy.

I don’t see any future in developing PhantomJS. Developing PhantomJS 2 and 2.5 as a single developer is a bloody hell. Even with recently released 2.5 Beta version with new and shiny QtWebKit, I can’t physically support all 3 platforms at once (I even bought the Mac for that!). We have no support. From now, I am stepping down as maintainer. If someone wants to continue - feel free to reach me.

Vitaly Slobodin

From this alone, it’s pretty clear that headless Chrome is going to play an increasingly important role in the headless browser space (and headless Firefox is just around the corner). This has huge repercussions for both automated testing and web scraping.

In the web scraping world, headless browsers are particularly useful when dealing with JavaScript heavy sites. You can sometimes just fetch data from the same internal API endpoints that a JavaScript app is using but it’s becoming increasingly popular to put measures in place that prevent that. Using an actual browser lets you appear to the server in exactly the same way that a typical user would which can be highly desirable in some circumstances. Headless browsers are also a great option if you’re doing a broad crawl and want to parse DOM content without figuring out how the data is being sourced on each site.

Whether in the context of testing or web scraping, headless browsers are generally used in conjunction with software like Nightmare or Selenium to automate user interactions. Most of these are fairly comparable but Selenium is one of the most popular options and that’s what we’ll focus on in this article. Selenium uses the WebDriver API to interact with different backends (e.g. Chrome, Firefox, PhantomJS), has broad compatibility with a variety of testing frameworks, and is also widely used for web scraping.

By the end of this article, you should be able to have Selenium up and running with Chrome in its new headless mode. We’ll also take a brief look at how this setup can be used to automate interactions with Facebook. If you’re new to Selenium then this can serve as a light introduction.

Setup

The first thing we’ll need is a version of Chrome that includes the headless functionality (versions greater than or equal to 59). This requires a development build as of April 14th, 2017 but will soon also work in the beta channel and then eventually the stable channel. You can check the Chrome Release Blog to get up to date version information about the different channels.

The development version of Chrome can be downloaded by appending extra=devchannel to the normal download page URLs. This link should autodetect your platform and lead you to the correct installer. If you have the option to use your system package manager to install Chrome then I would recommend doing that instead (but the details will depend on the distribution). On Arch Linux, for example, it can be installed from the AUR via

yaourt -S google-chrome-dev

Now that we have Chrome installed, we’ll need to set up a virtualenv and install our main dependencies.

# make the project directory
mkdir -p ~/scrapers/facebook
cd ~/scrapers/facebook

# create and activate a virtualenv
virtualenv env
. env/bin/activate

# install selenium locally
pip install selenium

# provides a nice repl, not needed for selenium
pip install ipython

We’ll also need to install a compatible version of ChromeDriver in order to connect Selenium to a headless Chrome instance. This code will automatically find the latest release and install the binary locally in env/bin/.

# platform options: linux32, linux64, mac64, win32
PLATFORM=linux64
VERSION=$(curl http://chromedriver.storage.googleapis.com/LATEST_RELEASE)
curl http://chromedriver.storage.googleapis.com/$VERSION/chromedriver_$PLATFORM.zip \
| bsdtar -xvf - -C env/bin/

If you’re on macOS then you’ll want to replace linux64 with mac64 or if you’re on Windows… well, if you’re on Windows then I hear that Windows Subsystem for Linux isn’t actually all that bad.

Once those commands complete, you can check that you selected the correct platform and that it downloaded OK by running the chromedriver command and verifying that it produces an output something like this.

(env) [sangaline@freon facebook]$ chromedriver
Starting ChromeDriver 2.29.461571 (8a88bbe0775e2a23afda0ceaf2ef7ee74e822cc5) on port 9515
Only local connections are allowed.

If you get an error instead then you should double check that you chose the correct platform (perhaps try linux32 instead of linux64). Once you can run chromedriver successfully, the basic setup process is complete!

You should be aware that we’ve noticed a little bit of buginess with the latest 2.29 release… but most things work fine. We compiled our own Linux 64 and macOS 64 binaries that are slightly newer but haven’t noticed an appreciable difference. You’re welcome to try them out instead of the official releases if you’re feeling adventurous though!

Configuring Selenium

We’ll work within an interactive IPython shell here so that we can enter a few commands at a time. We already installed this in the virtualenv earlier so we simply have to run ipython to drop into a REPL.

Now let’s begin configuring Selenium to work with headless Chrome. We’ll do this by first creating a ChromeOptions object that we can use to configure the options that will be passed to the WebDriver initializer.

from selenium import webdriver

options = webdriver.ChromeOptions()

Next we’ll need to configure the WebDriver to use the development channel chrome executable. This is likely /usr/bin/google-chrome-unstable on Linux and /Applications/Google Chrome 2.app/Contents/MacOS/Google Chrome on the operating system formerly known as OS X (note that it needs to be the full path to the binary and not just the application directory). Once you figure this out and verify that you can launch the unstable executable from the command line then specify the binary_location on our ChromeOptions object.

# tell selenium to use the dev channel version of chrome
options.binary_location = '/usr/bin/google-chrome-unstable'

We will also need to specify that Chrome should be started in headless mode. This can be done with the add_argument method

options.add_argument('headless')

which is equivalent to adding --headless as a command-line argument.

You can now specify any additional options and then finally initialize the driver.

# set the window size
options.add_argument('window-size=1200x600')

# initialize the driver
driver = webdriver.Chrome(chrome_options=options)

If nothing happens then everything worked! Normally, a new browser window would pop open at this point with a warning about being controlled by automated test software. It not appearing is exactly what we want to happen in headless mode and it means that we could be running our code on a server that doesn’t even have a graphical environment. Everything from here on out is just standard Selenium so if you were only trying to figure out how to get it working with Chrome in headless mode then that’s it!

Interacting With Facebook

Now that we have a WebDriver hooked up to a headless Chrome instance, we can use the standard Selenium API to run tests, scrape websites, or do whatever else we might be interested in. Let’s poke around on Facebook to see a little bit of what we can do. To call Facebook a “JavaScript heavy” site is a bit of an understatement so it serves as a good example of where headless browsers can be really useful.

We’ll start by navigating to the Facebook main page and grabbing the login form elements.

driver.get('https://facebook.com')

# wait up to 10 seconds for the elements to become available
driver.implicitly_wait(10)

# use css selectors to grab the login inputs
email = driver.find_element_by_css_selector('input[type=email]')
password = driver.find_element_by_css_selector('input[type=password]')
login = driver.find_element_by_css_selector('input[value="Log In"]')

The find_element_by_css_selector calls here will block until the elements are ready (thanks to implicitly_wait), so we’re immediately ready to enter our login credentials.

email.send_keys('evan@intoli.com')
password.send_keys('hunter2')

Let’s take a quick screenshot to make sure that everything looks good so far before we actually submit the login form.

driver.get_screenshot_as_file('main-page.png')

Main Facebook login page

That seems about right, so now let’s proceed with actually submitting the form.

# login
login.click()

# navigate to my profile
driver.get('https://www.facebook.com/profile.php?id=100009447446864')

# take another screenshot
driver.get_screenshot_as_file('evan-profile.png')

Evan's Facebook profile

You can see that we’ve logged in successfully and can browse around as we please. At this point, we could easily write code to automate adding new posts, scraping content, or basically anything else that we could do by hand. Let’s do one last scraping related activity before we wrap it up.

posts = driver.find_elements_by_css_selector('#stream_pagelet .fbUserContent')
for post in posts:
    try:
        author = post.find_elements_by_css_selector('a[data-hovercard*=user]')[-1].get_attribute('innerHTML')
        content = post.find_elements_by_css_selector('div.userContent')[-1].get_attribute('innerHTML')
    except IndexError:
        # it's an advertisement
        pass
    print(f'{author}: "{content}"')

This will print out the name and text content of each post on your Facebook feed. It’s a very simplified example but it’s easy to imagine how you could build a much more powerful scraper in a similar way.

Wrap Up

We’ve covered the process of running Selenium with the new headless functionality of Google Chrome. It’s probably a bit early to start using this in production but it’s fun to try things like this out as soon as you can. We suspect that a lot of exciting new projects are going to spring up around headless Chrome and we’re really looking forward to seeing what the future holds.

If you have any feedback or are looking for help putting together your own data solutions then please don’t hesitate to get in touch!