Intoli Blog

Have a great idea for an article?

We're always looking for guest contributors or article suggestions. Shoot us an email at blog@intoli.com, we would love to hear yours!

Building Data Science Pipelines with Luigi and Jupyter Notebooks

In this guest post, Mattia Ciollaro writes about how to get started with the Luigi task runner, and highlights his own contribution to Luigi through a special use case geared towards running Jupyter notebooks in your workflows. Mattia holds a PhD in statistics from Carnegie Mellon University and is working on improving American healthcare at Spreemo Health. You can get in touch with him via LinkedIn. Data Science meets plumbing In many data science projects, we often start by developing code to solve specific small tasks.

Continue reading

Dangerous Pickles — Malicious Python Serialization

What’s so dangerous about pickles? Those pickles are very dangerous pickles. I literally can’t begin to tell you how really dangerous they are. You have to trust me on that. It’s important, Ok? – “Explosive Disorder” by Pan Telare Before we get elbow deep in opcodes here, let’s cover a little background. The Python standard library has a module called pickle that is used for serializing and deserializing objects.

Continue reading

A Brief Tour of Grouping and Aggregating in Pandas

If you work with data in Python, chances are that you’ve heard of the pandas data manipulation library. You can think of pandas as a way to programmatically interact with spreadsheets. It works well with huge datasets, unlike its desktop counterparts like Google Sheets and Microsoft Excel, and implements a number of common database operations like merging, pivoting, and grouping. Moreover, being backed by numpy and efficient algorithm implementations makes it fast and easily integrated with other tools in the vast Python data science landscape.

Continue reading

Designing The Wayback Machine Loading Animation

SVGs and Wayback Machine Logos I’ve gotta say, I’m a big fan of Scalable Vector Graphics (SVG). They’re an open standard, they’re supported by all major browsers, they often take up less space than rasterized images, and they look clean and crisp at any size or resolution. XML, which the SVG standard is based upon, might not be the cool kid on the block these days, but it does make SVG really easy to parse and modify from basically any language.

Continue reading

Analyzing One Million robots.txt Files

One Million robots.txt Files The idea for this article actually started as a joke. We do a lot of web scraping here at Intoli and we deal with robots.txt files, overzealous ip bans, and all that jazz on a daily basis. A while back, I was running into some issues with a site that had a robots.txt file which was completely inconsistent with their banning policies, and I suggested that we should do an article on analyzing robots.

Continue reading

Check If A Website or URL Has Been Submitted to StumbleUpon

It can sometimes be a bit difficult to figure out whether a specific URL has been submitted to StumbleUpon yet because they don’t provide an easy way to search through their indexed sites. If you’re trying to figure out if your website–or a specific web page–has been submitted to StumbleUpon, then simply enter the URL into the widget below to fetch the latest information from StumbleUpon’s index. #url-checker input { margin-bottom: 10px; width: 100%; } #warning-message.

Continue reading

Fantasy Football for Hackers

There’s a First Time for Everything Like some 75 million other Americans, I am playing fantasy football this year. Unlike most of them, I know virtually nothing about football. I would estimate that I’ve watched somewhere around five games total in my life, most of them Super Bowls. I don’t know the rules beyond the very basics and I can’t name a single NFL player off the top of my head.

Continue reading

Installing Google Chrome On CentOS, Amazon Linux, or RHEL

Modifying the Official Google Chrome RPM to Run on Amazon Linux and CentOS 6 CentOS, Amazon Linux AMI, and Red Hat Enterprise Linux are three closely related GNU/Linux distributions which are all popular choices for server installations. They offer excellent performance and stability, but package availability can often be lacking. The Extra Packages for Enterprise Linux (EPEL), a community maintained repository of additional packages, significantly improves the situation, but doesn’t include Google Chrome/Chromium or a lot of other software that you would expect on more desktop-oriented distributions.

Continue reading

How Are Principal Component Analysis and Singular Value Decomposition Related?

Introduction Principal Component Analysis, or PCA, is a well-known and widely used technique applicable to a wide variety of applications such as dimensionality reduction, data compression, feature extraction, and visualization. The basic idea is to project a dataset from many correlated coordinates onto fewer uncorrelated coordinates called principal components while still retaining most of the variability present in the data. Singular Value Decomposition, or SVD, is a computational method often employed to calculate principal components for a dataset.

Continue reading

Scraping User-Submitted Reviews from the Steam Store

This article was originally published as a guest post on ScrapingHub’s blog. ScrapingHub is the company that wrote Scrapy, which this article is about, so read on to see why they liked it! Introduction The Steam game store is home to more than ten thousand games and just shy of four million user-submitted reviews. While all kinds of Steam data are available either through official APIs or other bulk-downloadable data dumps, I could not find a way to download the full review dataset.

Continue reading