By Evan Sangaline | June 13, 2017
I’ve been doing a lot of technical writing recently and, with that experience, I’ve grown to more deeply appreciate the writing of others. It’s easy to take the effort behind an article for granted when you’ve grown accustomed to there being new high-quality content posted every day on Hacker News and Twitter. The truth is that a really good article can take days or more to put together and it isn’t easy to write even one article that really takes off, let alone a steady stream of them.
Armed with my newfound admiration for people who create consistently high-quality content, I’ve been making an effort to keep track of any particularly good blogs that I come across and to read their new articles as they come out. Heck, I might even start using an RSS reader again. Is Google’s still the best one out there?
My blog collection had been growingly slowly and steadily, but, as a participation-trophy-carrying millenial, that simply wasn’t good enough for me. I wanted to find the real crème de la crème and I needed to go bigger in order to do that. So I turned to one of my favorite datasets, the Hacker News submission history, and sought out to compile a list of the best blogs out there by the most objective metrics that I could come up with. Neither my personal blog nor the Intoli blog made the cut, so there are obviously still some bugs to work out, but overall I’ve been really pleased with the results. Just kidding… mine totally made it. I’ve discovered a bunch of new (to me) writers and have had a lot of fun with the analysis behind the project.
Limiting the Data to Blogs
The raw dataset is a collection of stories submitted to Hacker News as obtained from the official Hacker News API. There were a total of 2.3 million stories linking to external URLs and, in addition to the URL, each story’s score, submission time, submitter, title, and number of comments are also known. Most of these stories, however, are not actually from blogs. In fact, only about 12.6% of them are.
To proceed with the analysis, I had to somehow make a determination of which stories were or weren’t blog submissions.
I did this by filtering on URL paths beginning with a few common prefixes:
I also included domains that started with
blog. and I added special handling of articles on Medium (
medium.com/@username/) because they were so common.
This approach turned up a total 56,197 distinct blogs and 291,801 stories which were used as the basis for the analysis. I know that there are some excellent blogs out there with more exotic URL schemes that unfortunately slipped through the cracks. I apologize if your excellent blog was one of those, but manual inspection showed that this filter was enough to identify the vast majority of blog submissions.
The “Best” Blogs
So the concept of “best” is obviously a bit subjective. There are some metrics that we can gather from the Hacker News data that seem definitively good though. I would say that blogs with more total articles, larger fractions of articles making the front page, and higher mean/median/maximum scores are generally going to be better. This is of course somewhat equating success on Hacker News with quality, but- given the dataset- this is all we really have to go off of.
Even after deciding on these generally positive metrics, the question of how to combine them still remains. It turns out that there’s a fairly well-defined way to determine the “best” configurations of these distinct metrics in some sense: selecting the Pareto optimal blogs. Pareto optimality is a concept that I think is most easily demonstrated graphically. Let’s take a look at the different configurations of the number of articles a blog has produced and the fraction of them that make the front page (using scores greater than or equal to ten as an approximation for “making the front page”).
Each point on this plot represents a single blog and the larger green points are the Pareto optimal blogs which make up the Pareto frontier. The Pareto optimal blogs are the ones for which there exist no other blogs that are strictly better, where strictly better means better in at least one metric while being worse in none of them. You can see that for every non-Pareto optimal blog you can find a strictly better second blog that does better in one of the metrics and is either equivalent or better in the second metric as well. For example, a blog with 100 articles that make the front page 50% of the time would be strictly better than one that had 99 articles and made the front page 50% of the time.
We can also look at other pairs of metrics and see similar trade-offs in their relative importance along the Pareto frontier. For example, here we can see the blogs that are Pareto optimal with respect to their average and maximum scores of their submissions.
The linear patterns radiating out from the center here correspond to small integer numbers of total articles. The maximum score will exactly equal the average score for blogs with only one submission, will approximately equal twice the average score for blogs with two submissions where only one did well, etc. This particular Pareto frontier is going to be dominated by blogs with a small number of submissions, but this won’t be an issue when we find the frontier in a higher dimensional metric space including the number of articles.
The Pareto frontier will contain the best possibilities for any weighting of the relative importance of each metric. This means that it will also include some metric weightings that are probably quite far from what anybody would consider a reasonable balance. The blog with the most submissions is Pareto optimal but nearly all of the entries get no upvotes and many are flagged as spam. The poor guy had one submission get 80 votes eight years ago and he’s been chasing the dream ever since (2104 times to be precise). Similarly, the blog that has the highest scoring submission on Hacker News is certainly notable for breaking an extremely important story that garnered 4107 votes but is otherwise fairly unremarkable.
To get around this issue, I added an additional restriction that all blogs must be at least average in each metric: total number of articles, fraction with score greater than or equal to ten, average score, median score, and maximum article score. This eliminates some of the more extreme configurations while otherwise leaving the Pareto frontier unchanged. The means and standard deviations for each metric are shown in the table that follows.
|Front Page Fraction||0.13||0.28|
After applying this minimum, the Pareto frontier consisted of 17 blogs which are shown here in order of those with the most to those with the fewest articles.
|Blog||Total Articles||Front Page Fraction||Average Score||Median Score||Maximum Score|
Unsurprisingly, the official Y Combinator blog and Sam Altman’s blog both make the cut.
several most of the company blogs are YC companies (e.g. Stripe, RethinkDB, GiftRocket, DataNitro, and GazeHawk).
Don’t get me wrong, these blogs all contain a lot of excellent material, but it’s interesting to note that at least 41% of the Pareto optimal blogs have some YC affiliation.
In the words of @dang, “Meta is basically crack,” and it seems likely that some of these blogs have received a little extra love due to their affiliation.
The ones that are most interesting to me are the personal blogs and, in particular, the ones that are new to me. A few standouts that I’m sure I’ll be coming back to are josephg.com/blog, www.catonmat.net/blog, and http://www.daemonology.net/blog/. Most of the others are either no longer updated or currently point to broken links.
I was honestly really hoping for some more personal blogs and ones that I hadn’t heard of. In order to eliminate some of the more well-known blogs, I decided to run the same optimization procedure on only the blogs which have been submitted by three or fewer distinct users on Hacker News. This resulted in 25 Pareto optimal blogs that do indeed seem to be tend more towards niche and personal blogs.
|Blog||Total Articles||Front Page Fraction||Average Score||Median Score||Maximum Score|
I don’t know who that sangaline.com fellow is, but he sure sounds handsome. In all seriousness, I do most of my writing on the Intoli blog these days, and if you want to read more of it then feel free to come by any time. We also have an RSS feed and a monthly digest newsletter of new articles if those are more your style.
Intoli Monthly Article Newsletter
Go ahead… you know you want to.
As for the other blogs on this second list, there are a few flops and dead links in there but also some real finds. Aaron Randall’s and Ben Cox’s blogs are simply awesome. ML@B, Adrian Sampson, Matt Greer, Mina Naguib, and Jimmy Breck-McKye are also very good. Some of the smaller company blogs also seem to have a lot of great archived content: goodfil.ms/blog, cam.ly/blog/, blog.directededge.com/.
I’m pretty happy with the results overall. I was after a few new high quality blogs to follow and I certainly found some that I really, really like. It’s a little tricky to balance looking for blogs that you haven’t heard of with wanting to find big and popular blogs, but there were easily at least five new blogs that I’ll definitely be following now. I hope that some other people out there found one or two that are new to them as well!
Oh.. and, as always, feel free to get in touch with us if you’re looking to get some help with your own data sourcing, aggregation, or processing. We love working on unique problems and would be happy to chat about whatever it is that you’re working on!
If you enjoyed this article, then you might also enjoy these related ones.
Intoli is launching a new Slack community called Web Scrapers where developers can chat about web scraping.
Learn how to use pandas to easily slice up a dataset and quickly extract useful statistics.