Exodus started with one simple goal in mind: to make it as easy as possible for a user to relocate working binaries from one Linux machine to another. For example, say that your laptop has a more recent version of gzip than what’s available though your server’s package manager, but that you really want to use a command-line flag that the older version doesn’t support. exodus gzip | ssh intoli.
Have a great idea for an article?
We're always looking for guest contributors or article suggestions. Shoot us an email at email@example.com, we would love to hear yours!
There’s a lot to love about CircleCI. First of all, continuous integration is just awesome in general. You can certainly develop fine software without it, but a good CI configuration can really make your life easier. Beyond that, CircleCI has a generous free tier, provides four free containers per open source project, allows the use of custom Docker images, and is reasonably easy to configure. There’s unfortunately also some stuff not to love about CircleCI.
A few months back, I wrote a popular article called Making Chrome Headless Undetectable in response to one called Detecting Chrome Headless by Antione Vastel. The one thing that I was really trying to get across in writing that is that blocking site visitors based on browser fingerprinting is an extremely user-hostile practice. There are simply so many variations in browser configurations that you’re inevitably going to end up blocking non-automated access to your website, and–on top of that–you’re really not accomplishing anything in terms of blocking sophisticated web scrapers.
Browser automation frameworks–like Puppeteer, Selenium, Marionette, and Nightmare.js–strive to provide rich APIs for configuring and interacting with web browsers. These generally work quite well, but you’re inevitably going to end up running into API limitations if you do a lot of testing or web scraping. You might find yourself wanting to conceal the fact that you’re using a headless browser, extract image resources from a web page, set the seed for Math.
In this post, I will highlight a few ways to save images while scraping the web through a headless browser. The simplest solution would be to extract the image URLs from the headless browser and then download them separately, but what if that’s not possible? Perhaps the images you need are generated dynamically or you’re visiting a website which only serves images to logged-in users. Maybe you just don’t want to put unnecessary strain on their servers by requesting the image multiple times.
In this guest post, Mattia Ciollaro writes about how to get started with the Luigi task runner, and highlights his own contribution to Luigi through a special use case geared towards running Jupyter notebooks in your workflows. Mattia holds a PhD in statistics from Carnegie Mellon University and is working on improving American healthcare at Spreemo Health. You can get in touch with him via LinkedIn. Data Science meets plumbing In many data science projects, we often start by developing code to solve specific small tasks.
What’s so dangerous about pickles? Those pickles are very dangerous pickles. I literally can’t begin to tell you how really dangerous they are. You have to trust me on that. It’s important, Ok? – “Explosive Disorder” by Pan Telare Before we get elbow deep in opcodes here, let’s cover a little background. The Python standard library has a module called pickle that is used for serializing and deserializing objects.
If you work with data in Python, chances are that you’ve heard of the pandas data manipulation library. You can think of pandas as a way to programmatically interact with spreadsheets. It works well with huge datasets, unlike its desktop counterparts like Google Sheets and Microsoft Excel, and implements a number of common database operations like merging, pivoting, and grouping. Moreover, being backed by numpy and efficient algorithm implementations makes it fast and easily integrated with other tools in the vast Python data science landscape.
SVGs and Wayback Machine Logos I’ve gotta say, I’m a big fan of Scalable Vector Graphics (SVG). They’re an open standard, they’re supported by all major browsers, they often take up less space than rasterized images, and they look clean and crisp at any size or resolution. XML, which the SVG standard is based upon, might not be the cool kid on the block these days, but it does make SVG really easy to parse and modify from basically any language.
One Million robots.txt Files The idea for this article actually started as a joke. We do a lot of web scraping here at Intoli and we deal with robots.txt files, overzealous ip bans, and all that jazz on a daily basis. A while back, I was running into some issues with a site that had a robots.txt file which was completely inconsistent with their banning policies, and I suggested that we should do an article on analyzing robots.