Javascript

Intoli Smart Proxies

Want to use the smartest web scraping proxies available?

Get started now and find out why Intoli is the best in the business!

Saving Images from a Headless Browser

In this post, I will highlight a few ways to save images while scraping the web through a headless browser. The simplest solution would be to extract the image URLs from the headless browser and then download them separately, but what if that’s not possible? Perhaps the images you need are generated dynamically or you’re visiting a website which only serves images to logged-in users. Maybe you just don’t want to put unnecessary strain on their servers by requesting the image multiple times.

Continue reading

Designing The Wayback Machine Loading Animation

SVGs and Wayback Machine Logos I’ve gotta say, I’m a big fan of Scalable Vector Graphics (SVG). They’re an open standard, they’re supported by all major browsers, they often take up less space than rasterized images, and they look clean and crisp at any size or resolution. XML, which the SVG standard is based upon, might not be the cool kid on the block these days, but it does make SVG really easy to parse and modify from basically any language.

Continue reading

Check If A Website or URL Has Been Submitted to StumbleUpon

It can sometimes be a bit difficult to figure out whether a specific URL has been submitted to StumbleUpon yet because they don’t provide an easy way to search through their indexed sites. If you’re trying to figure out if your website–or a specific web page–has been submitted to StumbleUpon, then simply enter the URL into the widget below to fetch the latest information from StumbleUpon’s index. #url-checker input { margin-bottom: 10px; width: 100%; } #warning-message.

Continue reading

Making Chrome Headless Undetectable

Detecting Headles Chrome A short article titled Detecting Chrome Headless popped up on Hacker News over the weekend and it has since been making the rounds. Most of the discussion on Hacker News was focused around the author’s somewhat dubious assertion that web scraping is a “malicious task” that belongs in the same category as advertising fraud and hacking websites. That’s always a fun debate to get into, but the thing that I really took issue with about the article was that it implicitly promoted the idea of blocking users based on browser fingerprinting.

Continue reading

Why I still don't use Yarn

But Isn’t Yarn the Best Node Package Manager? If you’re only comparing it to npm, then the answer is unequivocally yes. Yarn is generally much faster than npm and gives you deterministic builds by default, built-in integrity checking, license management tools, and a host of other goodies. Despite all of that, I still usually don’t use yarn. I avoid yarn for one simple reason: disk space usage. I feel like a bit of a curmudgeon here, but I find it a little absurd that it can easily take 100 MB, or more, to store a project consisting of a couple hundred lines of JavaScript if you want to use modern tooling (e.

Continue reading

How to Create a Public Slack Community with Open Invites

We recently created a public Slack community dedicated to web scraping in order to provide a general forum for people to discuss topics related to browser automation, headless browsers, scraping frameworks, data pipelining, or anything else along those lines. We wanted it to be open to anyone who wanted to join, but Slack unfortunately doesn’t really provide any sort of open-access Slack communities or channels. If you want to make your Slack community open to anybody, then your options are to either send invitations to anyone who expresses interest, or to generate shared invite URLs which expire after four weeks.

Continue reading

Using Puppeteer to Scrape Websites with Infinite Scrolling

Infinite scrolling has become a ubiquitous design pattern on the web. Social media sites like Facebook, Twitter, and Instagram all feature infinitely scrolling feeds to keep users engaged with an essentially unbounded amount of content. Here’s what that looks like on Instagram, for example. This mechanism is typically implemented by using JavaScript to detect when the user has scrolled far enough down the existing feed, and then querying an underlying API endpoint for the next batch of data that gets processed and dynamically injected into the page.

Continue reading

Using Webpack to Render Markdown in React Apps

This article is a tutorial explaining how to set up your Webpack configuration for rendering and displaying Markdown documents in React components. Something like this could come in handy if you’re building a home-made static blog engine, or if you’re hoping to easily include some good-looking documentation in a frontend application. Since I tend to write a lot of code blocks in my Markdown documents, a good chunk of the tutorial will be focused on making them look good.

Continue reading

How to Run a Keras Model in the Browser with Keras.js

This article explains how to export a pre-trained Keras model written in Python and use it in the browser with Keras.js. The main difficulty lies in choosing compatible versions of the packages involved and preparing the data, so I’ve prepared a fully worked out example that goes from training the model to performing a prediction in the browser. You can find the working end-result in Intoli’s article materials repository, but do read on if you’d like just the highlights.

Continue reading