User-Agents — Generating random user agents using Google Analytics and CircleCI

By Evan Sangaline | August 30, 2018

If you’re in a hurry, you can head straight to the user-agents repository for installation and usage instructions!

While web scraping, it’s usually a good idea to create traffic patterns consistent with those that a human user would produce. This of course means being respectful and rate-limiting requests, but it often also means concealing the fact that the requests have been automated. Doing so helps avoid getting blocked by overzealous DDOS protection services, and allows you to successfully scrape the data that you’re interested in while keeping site operators happy.

Making realistic browsing patterns can get pretty complicated–we’ve previously explained some sophisticated techniques in articles like Making Chrome Headless Undetectable and It Is Not Possible to Detect and Block Chrome Headless. The Intoli Smart Proxy Service even goes far beyond the methods that those articles describe in order to create browsing patterns that are completely indistinguishable from human users. These advanced concealment strategies are necessary when scraping data from sites protected by bot-mitigation services like Distil and Incapsula, but they’re not always required when scraping websites that are less aggressive about blocking. For these more basic websites, using realistic User-Agent headers is sometimes all you need.

The only problem is that there aren’t a lot of great resources out there for generating realistic user agents. You need current analytics data from a high traffic website in order to generate random user agents, and this sort of data is seldom made public. There are a handful of paid solutions out there, but the free lists only offer a limited slice of data and usually become outdated very quickly (you can check out Wikipedia’s Usage share of web browsers article to see this for yourself). The situation with open source libraries for random user agent generation is even worse; they’re typically published once or twice and then never updated.

User agent statistics are only really useful for web scraping when they’re up to date, and the few truncated lists that you find when you Google things like “most common user agents” are generally too limited to apply at scale. The Intoli website gets a pretty healthy amount of traffic–and we’re big fans of open information–so this seemed liked a natural opportunity for us to step in and provide the community with a useful resource for web scraping. Long story short: we used our site analytics data to generate realistic usage statistics, and built a new open-source JavaScript library called user-agents for generating random user agents. This is far from the first open-source library to tackle this problem, but we strongly believe that it fills a void in the available tooling.

A few of the key User-Agents features that set the library apart from existing solutions:

  • The user agent data is always up to date. We update the package daily with new data from the Intoli website.

  • The data not only includes user agent strings, but also the corresponding window.navigator properties like navigator.vendor and navigator.platform. This data is very difficult to come by, and is used extensively by bot-mitigation services to block browsers.

  • The data also includes viewport and screen sizes so that they can also be accurately emulated.

  • The random user agent generation can easily be filtered to restrict the user agents by device type, screen size, browser version, or any other available data.

  • The data is available as a raw compressed JSON file, so that it can be easily used in other programming languages.

The package itself is available on npm, and it can be installed by running the following.

# Or with yarn: `yarn add user-agents`
npm install user-agents

Then basic usage is as simple as

import UserAgent from 'user-agents';

const userAgent = new UserAgent();
// Example output:
//   User Agent: "Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
console.log(`User Agent: "${userAgent}"`);

which will output a random user agent based on how commonly they’re observed in real user data. More advanced usage can include filtering the user agents based on properties like device category

// Restricts the generated user agents to correspond to mobile devices.
const userAgent = new UserAgent({ deviceCategory: 'mobile' });

or accessing detailed properties on the generated user agent objects. For example, you could access the generated viewport size like this:

const userAgent = userAgents.random();
// Example output: 800x600
console.log(`${userAgent.viewportWidth}x${userAgent.viewportHeight}`);

More details on usage and the API are available in the User-Agents repository on GitHub. That’s a good place to start if you’re interested in using User-Agents in your web scraping project. Also be sure to star the repository while you’re over there; it lets us know that people use the project and encourages us to devote more developer resources towards it!

In the rest of this article, we’ll explain how we keep the User-Agents data up to date. The basic idea is that we run a scheduled build on CircleCI every night that fetches the data from Google Analytics and digests it into an anonymized form that’s then committed to the repository and published on npm. You don’t need to understand these details in order to use the User-Agents package, but we thought it was interesting enough to share in case people were curious about how it works. We also wanted to be completely transparent about how this data is collected and exactly what it’s used for.

How It Works

Collecting the data

Like many websites, we use Google Analytics to monitor traffic on our site. Google Analytics tracks a variety of dimensions by default, but the browser user agent isn’t one of them. It breaks this information up into separate quantities like the name of the browser, the version of the browser, etc. In order to track the user agent directly, we needed to add a custom dimension to our analytics.

Add Custom Dimension

You can see here that we set the dimension to be session-scoped so that the user agent data would be weighted by visitors to the site rather than pageviews. To start actually collecting this data, we simply needed to use the Google Analytics set command to specify that the user agent is equal to the value of the navigator.userAgent property.

ga('set', 'dimension1', navigator.userAgent);

We also added additional custom dimensions for related quantities like navigator.appName, but we we’ll skip over these in the code examples for brevity. Note that we only use these quantities to generate anonymized data for inclusion in the User-Agents package. If you would like to prevent analytics services from collecting this sort of information, then we highly recommend installing uBlock Origin in your browser to block tracking. It also blocks ads and malware, so it’s a useful extension all around.

Configuring API Access to the Raw Data

After we started tracking the data, we needed to be able to access it via an API so that we could automate the process of updating the User-Agents package. Google has a concept of service accounts which can be used to allow exactly this sort of access. To configure this, we first created a new project called “User-Agents NPM Package” on Google’s service accounts page

Create Google Project

and then enabled the Google Analytics Reporting API for the project.

Enable Report API v3

Note that we enabled version 3 of the API rather than the newer version 4, so that we could use the handy ga-api package for authenticating and accessing the API. We’ll come back to that later after we finish setting up the service account and access credentials.

Next, we added a set of credentials to the project, and created a service account in the process.

Add Credentials to the Project

This generated a JSON credential file for the project that we downloaded and saved as google-analytics-credentials.json.

Service Account Created

The JSON credentials are what we use to authenticate with the analytics API, and their contents look something like this.

{
  "type": "service_account",
  "project_id": "user-agents-npm-package",
  "private_key_id": "99f45b8c31520345ab960f17add21da91fc7d2b5",
  "private_key": "-----BEGIN PRIVATE KEY-----[REDACTED]-----END PRIVATE KEY-----",
  "client_email": "user-agents-npm-package-update@user-agents-npm-package.iam.gserviceaccount.com",
  "client_id": "118408973529835432350",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/user-agents-npm-package-update%40user-agents-npm-package.iam.gserviceaccount.com"
}

The client_email field is of particular interest here; it’s a unique email address that has been assigned to the service account. As a final step, we needed to create a Google Analytics user for this email address that had read-access to our analytics property.

Add Google Analytics Permissions for the Email Address

With this last piece in place, we were ready to start accessing the API and looking at real data.

Updating the Project Data

The ga-api JavaScript package is a light wrapper around the Google Analytics API that simplifies the process of authenticating and making requests. It can be installed using npm or yarn by running the following.

# Or with yarn: `yarn add ga-api@0.0.4`
npm install ga-api@0.0.4

Note that I’ve included an exact version here because I’m a bit weary of passing credentials to a third-party library, but I’ve personally reviewed the code for the v0.0.4 release and know that it’s safe.

Using the library is fairly straightforward and well-documented in the project’s README. As a quick example, the following code queries the analytics property and output rows of data showing the count of sessions for each observed user agent.

const gaApi = require('ga-api');

const accountOptions = {
  clientId: 'user-agents-npm-package-update.apps.googleusercontent.com',
  email: 'user-agents-npm-package-update@user-agents-npm-package.iam.gserviceaccount.com',
  key: 'google-analytics-credentials.json',
  ids: 'ga:115995502',
};

const queryOptions = {
  startDate: '2018-08-26',
  endDate: '2018-08-26',
  dimensions: 'ga:deviceCategory,ga:dimension9',
  metrics: 'ga:sessions',
};

gaApi({ ...accountOptions, ...queryOptions }, (error, data) => {
  if (error) {
    console.error(error);
  } else {
    console.log(data.rows);
  }
});

It outputs something similar to the following when run.

[
  [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "8356"
  ],
  [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
    "7311"
  ],
  [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
    "6908"
  ],
  [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15",
    "5921"
  ],
  ["// And so on..."]
]

Given this raw data, we simply needed to save it into a JSON file that can be used by the user-agent package to generate random user agents. The exact details of this are a bit more complicated than the example given here, but you can check out the real code in update-data.js and the real data in user-agents.json.gz if you would like to see the full implementation.

Releasing New Versions Automatically

The major issue with existing user agent generation libraries is that they’re all woefully out of date. The random-user-agent and random-useragent JavaScript packages were both updated two years ago. Even the purportedly always-up-to-date fake-useragent Python package was last updated six months ago. There are literally dozens of other similar projects out there that have all apparently abandoned. It’s understandable that a project’s author would lose interest in manually updating a package like this regularly, and that’s exactly why we thought that an automated release process was so important for the User-Agents package.

We use CircleCI as our continuous integration provider, and they conveniently provide a built-in mechanism to perform scheduled builds. You can check out the user-agent project’s full config.yml configuration file for all of the details, but we’ll also walk through the general idea here. To start with, there’s a CircleCI job that handles updating the data.

  update:
    <<: *defaults
    steps:
      - attach_workspace:
          at: ~/user-agents
      - restore_cache:
          key: dependency-cache-{{ checksum "yarn.lock" }}
      - run:
          name: Update the user agents data
          command: |
            echo "$GOOGLE_ANALYTICS_CREDENTIALS" | base64 --decode > ./google-analytics-credentials.json
            yarn update-data
      - persist_to_workspace:
          root: ~/user-agents
          paths:
          <<: *whitelist

The update-data script that is run here basically corresponds to a more complicated version of the script that we developed in the last section.

Before we actually run the update-data script, we first populate the google-analytics-credentials.json file that is required to access the raw data from the Google Analytics API.

echo "$GOOGLE_ANALYTICS_CREDENTIALS" | base64 --decode > ./google-analytics-credentials.json

In order to make this work, we first base64 encoded the credentials by running

base64 --wrap 0 google-analytics-credentials.json

and then added the contents to the CircleCI build environment as an environment variable called GOOGLE_ANALYTICS_CREDENTIALS.

Add an Environment Variable to CircleCI

The contents of this environment variable is obviously sensitive, so we also disable providing secrets to forked builds in the CircleCI project settings. Not doing so is a common configuration error that can lead to credentials getting stolen.

Disable passing secrets to forked builds

Next, there’s a publish-new-version job that commits the updated data to the repository and pushes up the changes to GitHub.

  publish-new-version:
    <<: *defaults
    steps:
      - attach_workspace:
          at: ~/user-agents
      - run:
          name: Commit the newly downloaded data
          command: |
            git add src/*
            # Configure some identity details for the machine deployment account.
            git config --global user.email "user-agents@intoli.com"
            git config --global user.name "User Agents"
            git config --global push.default "simple"
            # Disable strict host checking.
            mkdir -p ~/.ssh/
            echo -e "Host github.com\n\tStrictHostKeyChecking no\n" >> ~/.ssh/config
            # The status code will be 1 if there are no changes,
            # but we want to publish anyway to stay on a regular schedule.
            git commit -m 'Regularly scheduled user agent data update.' || true
      - run:
          name: Bump the patch version and trigger a new release
          command: npm version patch && git push && git push --tags

In order to provide write access to the repository, we created a GitHub machine user called user-agents with the appropriate collaborator access.

GitHub Collaborators

We generated an SSH key for this machine user’s account, and specified in the CircleCI UI that the build should use this write-access key instead of the default checkout key.

CircleCI Checkout Keys

With this in place, the publish-new-version job was able to commit the updated data, create a new patch version for the project, and push up the changes to GitHub. Pushing these changes, however, doesn’t automatically publish a new version on NPM.

To allow automatically publishing new package versions, we added an additional CircleCI job called deploy.

  deploy:
    <<: *defaults
    steps:
      - attach_workspace:
          at: ~/user-agents
      - run:
          name: Write NPM Token to ~/.npmrc
          command: echo "//registry.npmjs.org/:_authToken=$NPM_TOKEN" >> ~/.npmrc
      - run:
          name: Install dot-json package
          command: npm install dot-json
      - run:
          name: Write version to package.json
          command: $(yarn bin)/dot-json package.json version ${CIRCLE_TAG:1}
      - run:
          name: Publish to NPM
          command: npm publish --access=public

The logic here is fairly simple:

  1. We populate the ~.npmrc file with an NPM authorization token that’s stored in an NPM_TOKEN environment variable in a similar way to how we stored the Google Analytics API credentials in GOOGLE_ANALYTICS_CREDENTIALS.

  2. We use the dot-json package to update the package version in package.json to match the version tag. The npm version patch command from the publish-new-version job will create tags like v0.0.2, and this code anticipates those being present in the CIRCLE_TAG environment variable (which is added automatically to the environment by CircleCI.

  3. We publish the new package to NPM using npm publish.

The final piece of the puzzle is to set up the conditions under which these various jobs will run. We did this by defining two CircleCI workflows in the config.yml file.

workflows:
  version: 2

  scheduled-release:
    triggers:
      - schedule:
          cron: "00 06 * * *"
          filters:
            branches:
              only:
                - master

    jobs:
      - checkout
      - update:
          requires:
            - checkout
      - build:
          requires:
            - update
      - test:
          requires:
            - build
      - publish-new-version:
          requires:
            - test

release:
    jobs:
      - checkout:
          filters:
            tags:
              only: /v[0-9]+(\.[0-9]+)*/
            branches:
              ignore: /.*/
      - build:
          filters:
            tags:
              only: /v[0-9]+(\.[0-9]+)*/
            branches:
              ignore: /.*/
          requires:
            - checkout
      - test:
          filters:
            tags:
              only: /v[0-9]+(\.[0-9]+)*/
            branches:
              ignore: /.*/
          requires:
            - build
      - deploy:
          filters:
            tags:
              only: /v[0-9]+(\.[0-9]+)*/
            branches:
              ignore: /.*/
          requires:
            - test

The scheduled-release workflow is configured to be run on a cron schedule at 6 AM everyday. It runs a sequence of jobs: checkout, update, build, and then test. We only covered the update job above, but the others are fairly self-explanatory: checkout checks out the code from the repository, build builds the project, and test runs the project tests.

Finally, we define a release workflow that runs the checkout, build, test, and deploy jobs. The key here is that the jobs in this workflow only run when there’s a new git tag matching the /v[0-9]+(\.[0-9]+)*/ regular expression. This condition will be met both when we run npm version locally, and when its run in the scheduled-release workflow’s update job. This allows us to publish things manually while allowing the automated releases to run on a daily basis.

Conclusion

Well, we hope that you’ve enjoyed learning about how we keep the User Agents package consistently up to date. Automation is the key, and both the user-agents GitHub repository and the user-agents NPM package are updated every morning like clockwork. This means that when we say “always up to date,” we mean it!

If your web scraping tasks are still getting blocked when using random user agents, then be sure to check out the Intoli Smart Proxy Service. It integrates all of the web-scraping best-practices that we’ve learned over the years, and it works with pretty much any web scraping software that you might be using.

Suggested Articles

If you enjoyed this article, then you might also enjoy these related ones.

Performing Efficient Broad Crawls with the AOPIC Algorithm

By Andre Perunicic
on September 16, 2018

Learn how to estimate page importance and allocate bandwidth during a broad crawl.

Read more

How F5Bot Slurps All of Reddit

By Lewis Van Winkle
on July 30, 2018

The creator of F5Bot explains in detail how it works, and how it's able to scrape million of Reddit comments per day.

Read more

No API Is the Best API — The elegant power of Power Assert

By Evan Sangaline
on July 24, 2018

A look at what makes power-assert our favorite JavaScript assertion library, and an interview with the project's author.

Read more

Comments