Intoli Blog

Have a great idea for an article?

We're always looking for guest contributors or article suggestions. Shoot us an email at blog@intoli.com, we would love to hear yours!

What's New in Exodus 2.0

Exodus started with one simple goal in mind: to make it as easy as possible for a user to relocate working binaries from one Linux machine to another. For example, say that your laptop has a more recent version of gzip than what’s available though your server’s package manager, but that you really want to use a command-line flag that the older version doesn’t support. exodus gzip | ssh intoli.

Continue reading

Extending CircleCI's API with a Custom Microservice on AWS Lambda

There’s a lot to love about CircleCI. First of all, continuous integration is just awesome in general. You can certainly develop fine software without it, but a good CI configuration can really make your life easier. Beyond that, CircleCI has a generous free tier, provides four free containers per open source project, allows the use of custom Docker images, and is reasonably easy to configure. There’s unfortunately also some stuff not to love about CircleCI.

Continue reading

It is *not* possible to detect and block Chrome headless

A few months back, I wrote a popular article called Making Chrome Headless Undetectable in response to one called Detecting Chrome Headless by Antione Vastel. The one thing that I was really trying to get across in writing that is that blocking site visitors based on browser fingerprinting is an extremely user-hostile practice. There are simply so many variations in browser configurations that you’re inevitably going to end up blocking non-automated access to your website, and–on top of that–you’re really not accomplishing anything in terms of blocking sophisticated web scrapers.

Continue reading

JavaScript Injection with Selenium, Puppeteer, and Marionette in Chrome and Firefox

Browser automation frameworks–like Puppeteer, Selenium, Marionette, and Nightmare.js–strive to provide rich APIs for configuring and interacting with web browsers. These generally work quite well, but you’re inevitably going to end up running into API limitations if you do a lot of testing or web scraping. You might find yourself wanting to conceal the fact that you’re using a headless browser, extract image resources from a web page, set the seed for Math.

Continue reading

Saving Images from a Headless Browser

In this post, I will highlight a few ways to save images while scraping the web through a headless browser. The simplest solution would be to extract the image URLs from the headless browser and then download them separately, but what if that’s not possible? Perhaps the images you need are generated dynamically or you’re visiting a website which only serves images to logged-in users. Maybe you just don’t want to put unnecessary strain on their servers by requesting the image multiple times.

Continue reading

Building Data Science Pipelines with Luigi and Jupyter Notebooks

In this guest post, Mattia Ciollaro writes about how to get started with the Luigi task runner, and highlights his own contribution to Luigi through a special use case geared towards running Jupyter notebooks in your workflows. Mattia holds a PhD in statistics from Carnegie Mellon University and is working on improving American healthcare at Spreemo Health. You can get in touch with him via LinkedIn. Data Science meets plumbing In many data science projects, we often start by developing code to solve specific small tasks.

Continue reading

Dangerous Pickles — Malicious Python Serialization

What’s so dangerous about pickles? Those pickles are very dangerous pickles. I literally can’t begin to tell you how really dangerous they are. You have to trust me on that. It’s important, Ok? – “Explosive Disorder” by Pan Telare Before we get elbow deep in opcodes here, let’s cover a little background. The Python standard library has a module called pickle that is used for serializing and deserializing objects.

Continue reading

A Brief Tour of Grouping and Aggregating in Pandas

If you work with data in Python, chances are that you’ve heard of the pandas data manipulation library. You can think of pandas as a way to programmatically interact with spreadsheets. It works well with huge datasets, unlike its desktop counterparts like Google Sheets and Microsoft Excel, and implements a number of common database operations like merging, pivoting, and grouping. Moreover, being backed by numpy and efficient algorithm implementations makes it fast and easily integrated with other tools in the vast Python data science landscape.

Continue reading

Designing The Wayback Machine Loading Animation

SVGs and Wayback Machine Logos I’ve gotta say, I’m a big fan of Scalable Vector Graphics (SVG). They’re an open standard, they’re supported by all major browsers, they often take up less space than rasterized images, and they look clean and crisp at any size or resolution. XML, which the SVG standard is based upon, might not be the cool kid on the block these days, but it does make SVG really easy to parse and modify from basically any language.

Continue reading

Analyzing One Million robots.txt Files

One Million robots.txt Files The idea for this article actually started as a joke. We do a lot of web scraping here at Intoli and we deal with robots.txt files, overzealous ip bans, and all that jazz on a daily basis. A while back, I was running into some issues with a site that had a robots.txt file which was completely inconsistent with their banning policies, and I suggested that we should do an article on analyzing robots.

Continue reading