By Andre Perunicic | January 11, 2018
Infinite scrolling has become a ubiquitous design pattern on the web. Social media sites like Facebook, Twitter, and Instagram all feature infinitely scrolling feeds to keep users engaged with an essentially unbounded amount of content. Here’s what that looks like on Instagram, for example.
This mechanism is typically implemented by using JavaScript to detect when the user has scrolled far enough down the existing feed, and then querying an underlying API endpoint for the next batch of data that gets processed and dynamically injected into the page. If you’re interested in scraping data from an infinitely scrolling page, your first approach might be to query these API endpoints directly. In fact, we wrote a guide about scraping steampowered.com which features a section describing how to do this in one specific case.
As you can imagine, reverse-engineering a site’s data delivery mechanism can be both time-consuming and require some knowledge of the underlying technologies. Instagram’s tag feeds, like #javascript which is pictured above, get new post data delivered through GraphQL API endpoints which use query IDs, cursors, and pagination to incrementally deliver more data. Other sites might deliver data in real time through WebSockets, or process data from multiple endpoints before injecting something into the page. In these cases, it often makes sense to use a headless browser to emulate scrolling on the page and simply get the data you need from the rendered elements.
In this article, I will demonstrate how to use Puppeeter to scrape data from a page using infinite scroll. Puppeteer is a relatively new contender in the browser automation space that uses a user-friendly interface built on top of the DevTools Protocol API to drive a bundled version of Chromium. This enables short scripts that, with a bit of patience, allow you to easily get as much infinite scroll data as the web page will show you!
A Simple Infinite Scrolling Demo
I put together a basic infinite scrolling demo page just for this article. The page doesn’t actually make any API round trips to get new data for the infinite scroll. Instead, it just emulates realistic behavior by injecting some HTML elements to the bottom of the page half a second after the user has scrolled far enough. After 110 items have been loaded, the delay between scrolling and loading new items into the page goes up to 31 seconds in order to emulate request throttling. Since our script will be grabbing data directly from rendered elements, this actually won’t make any practical difference to how it’s written! After all, one of the benefits of using a headless browser to scrape the web is that you don’t really need to understand how the site fetches and processes the underlying data.
Scraping Data from an Infinite Scroll
Assuming you have npm
installed, getting a Puppeteer project started is as simple as executing the following:
mkdir infinite-scroll
cd infinite-scroll
npm install --save puppeteer
This will also install a bundled version of the Chromium browser for use by Puppeteer, so we can focus on writing the scraping script right away.
If you’d like to explore the finished code yourself, you can check it out from our article materials GitHub repository.
Otherwise, create a file named scrape-infinite-scroll.js
in your favorite text editor and add to it the following.
const fs = require('fs');
const puppeteer = require('puppeteer');
function extractItems() {
const extractedElements = document.querySelectorAll('#boxes > div.box');
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
async function scrapeInfiniteScrollItems(
page,
extractItems,
itemTargetCount,
scrollDelay = 1000,
) {
let items = [];
try {
let previousHeight;
while (items.length < itemTargetCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitFor(scrollDelay);
}
} catch(e) { }
return items;
}
(async () => {
// Set up browser and page.
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
page.setViewport({ width: 1280, height: 926 });
// Navigate to the demo page.
await page.goto('https://intoli.com/blog/scrape-infinite-scroll/demo.html');
// Scroll and extract items from the page.
const items = await scrapeInfiniteScrollItems(page, extractItems, 100);
// Save extracted items to a file.
fs.writeFileSync('./items.txt', items.join('\n') + '\n');
// Close the browser.
await browser.close();
})();
Running the script with
node scrape-infinite-scroll.js
causes it to visit the demo page, scroll until 100 #boxes > div.box
“items” have been loaded, and save the text from the extracted items in ./items.txt
.
Running tail ./items.txt
shows us that the last 10 lines of the file are indeed
Infinite Scroll Box 91
Infinite Scroll Box 92
Infinite Scroll Box 93
Infinite Scroll Box 94
Infinite Scroll Box 95
Infinite Scroll Box 96
Infinite Scroll Box 97
Infinite Scroll Box 98
Infinite Scroll Box 99
Infinite Scroll Box 100
Let’s briefly discuss how the script works.
Thanks to the fact that Puppeteer’s methods are Promise-based, by placing everything in an async
wrapper, we are able to await
for key steps in the script and write the code as if it executes synchronously.
The initial few lines are just boilerplate that handles configuring and starting the browser, and directing the headless browser to the page we wish to scrape.
The actual scrolling and extraction are done with a call to scrapeInfiniteScrollItems
.
This function uses page.evaluate
to repeatedly scroll to the bottom of the page and extract any available items via the injected extractItems
method, until at least itemTargetCount
many items have been extracted, as you can see from this block:
let previousHeight;
while (items.length < itemTargetCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitFor(scrollDelay);
}
The wait for scrollDelay
milliseconds between scrolls is there to help avoid request throttling in production.
It’s worth noting that that
items = await page.evaluate(extractItems);
will serialize the extractItems
function before evaluating it in the browser’s context, making the lexical environment in which the function was defined unavailable during execution.
Make sure to include everything you need for item extraction in the function’s definition.
You might be wondering what happens if the script is never able to extract 100 items
from the page.
In Puppeteer, functions that evaluate JavaScript on the page like page.waitForFunction
generally have a 30 second timeout (which can be customized).
The call
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
will wait for the height of the page to increase after each scroll, presumably when the page loads more items, and break the while
loop by throwing an error if the height doesn’t change for 30 seconds.
The error is handled thanks to the try-catch
block surrounding the loop, which simply does nothing when an error is caught.
I previously mentioned that the demo page starts injecting items with a 31 second delay after 110 items have been loaded.
You can play with this by increasing the itemTargetCount
to 120, say, with a call like
const items = await scrapeInfiniteScrollItems(page, extractItems, 120);
and possibly logging or re-throwing the error in the catch block. Feel free to modify the complete script to your liking, and let us know your thoughts in the comments below!
Conclusion
We saw that we can use Puppeteer to scrape infinite scrolls without having to dig into the underlying data-delivery mechanism. There are of course times when this strategy is undesirable, such as when you want to resume scraping from the middle of a feed at a later time, but the script developed in this article should be easy to customize and serve as a starting point for emulating human-like scrolling on a web page.
If you enjoyed this article, consider subscribing to our mailing list or browsing the rest of our blog. We offer a broad range of scraping, development, and automation services here at Intoli, so please don’t hesitate to contact us if you need help on a project!
Suggested Articles
If you enjoyed this article, then you might also enjoy these related ones.
Performing Efficient Broad Crawls with the AOPIC Algorithm
Learn how to estimate page importance and allocate bandwidth during a broad crawl.
Breaking Out of the Chrome/WebExtension Sandbox
A short guide to breaking out of the WebExtension content script sandbox.
User-Agents — Generating random user agents using Google Analytics and CircleCI
A free dataset and JavaScript library for generating random user agents that are always current.
Comments