Browser automation frameworks–like Puppeteer, Selenium, Marionette, and Nightmare.js–strive to provide rich APIs for configuring and interacting with web browsers. These generally work quite well, but you’re inevitably going to end up running into API limitations if you do a lot of testing or web scraping. You might find yourself wanting to conceal the fact that you’re using a headless browser, extract image resources from a web page, set the seed for Math.
Have a great idea for an article?
We're always looking for guest contributors or article suggestions. Shoot us an email at firstname.lastname@example.org, we would love to hear yours!
In this guest post, Mattia Ciollaro writes about how to get started with the Luigi task runner, and highlights his own contribution to Luigi through a special use case geared towards running Jupyter notebooks in your workflows. Mattia holds a PhD in statistics from Carnegie Mellon University and is working on improving American healthcare at Spreemo Health. You can get in touch with him via LinkedIn. Data Science meets plumbing In many data science projects, we often start by developing code to solve specific small tasks.
What’s so dangerous about pickles? Those pickles are very dangerous pickles. I literally can’t begin to tell you how really dangerous they are. You have to trust me on that. It’s important, Ok? – “Explosive Disorder” by Pan Telare Before we get elbow deep in opcodes here, let’s cover a little background. The Python standard library has a module called pickle that is used for serializing and deserializing objects.
If you work with data in Python, chances are that you’ve heard of the pandas data manipulation library. You can think of pandas as a way to programmatically interact with spreadsheets. It works well with huge datasets, unlike its desktop counterparts like Google Sheets and Microsoft Excel, and implements a number of common database operations like merging, pivoting, and grouping. Moreover, being backed by numpy and efficient algorithm implementations makes it fast and easily integrated with other tools in the vast Python data science landscape.
One Million robots.txt Files The idea for this article actually started as a joke. We do a lot of web scraping here at Intoli and we deal with robots.txt files, overzealous ip bans, and all that jazz on a daily basis. A while back, I was running into some issues with a site that had a robots.txt file which was completely inconsistent with their banning policies, and I suggested that we should do an article on analyzing robots.
There’s a First Time for Everything Like some 75 million other Americans, I am playing fantasy football this year. Unlike most of them, I know virtually nothing about football. I would estimate that I’ve watched somewhere around five games total in my life, most of them Super Bowls. I don’t know the rules beyond the very basics and I can’t name a single NFL player off the top of my head.
This article was originally published as a guest post on ScrapingHub’s blog. ScrapingHub is the company that wrote Scrapy, which this article is about, so read on to see why they liked it! Introduction The Steam game store is home to more than ten thousand games and just shy of four million user-submitted reviews. While all kinds of Steam data are available either through official APIs or other bulk-downloadable data dumps, I could not find a way to download the full review dataset.
Detecting Headles Chrome A short article titled Detecting Chrome Headless popped up on Hacker News over the weekend and it has since been making the rounds. Most of the discussion on Hacker News was focused around the author’s somewhat dubious assertion that web scraping is a “malicious task” that belongs in the same category as advertising fraud and hacking websites. That’s always a fun debate to get into, but the thing that I really took issue with about the article was that it implicitly promoted the idea of blocking users based on browser fingerprinting.
Update: This article is regularly updated in order to accurately reflect improvements in Firefox’s headless browsing capabilities. Note: Check out Running Selenium with Healdess Chrome if you’d rather use Google’s browser. Using Selenium with Headless Firefox (on Windows) Ever since Chrome implemented headless browsing support back in April, the other major browsers started following suit. In particular, Mozilla has since then expanded support for Firefox’s headless mode from Linux to its Windows and macOS builds, and fixed a number of bugs that might have been in the way of early adopters.
UPDATE: This article is updated regularly to reflect the latest information and versions. If you’re looking for instructions then skip ahead to see Setup Instructions. NOTE: Be sure to check out Running Selenium with Headless Chrome in Ruby if you’re interested in using Selenium in Ruby instead of Python. Background It has long been rumored that Google uses a headless variant of Chrome for their web crawls. Over the last two years or so it had started looking more and more like this functionality would eventually make it into the public releases and, as of this week, that has finally happened.