By Lewis Van Winkle | July 30, 2018

Follow @codeplea Star

In this guest post, Lewis Van Winkle talks about F5Bot, a free service that emails you when selected keywords are mentioned on Reddit, Hacker News, or Lobsters. He explains in detail how F5Bot is able to process millions of comments and posts from Reddit every day on a single VPS. You can check out more of Lewis Van Winkle’s writing at codeplea.com, and his open source contributions at github.com/codeplea.

I run a free service, F5Bot. The basic premise is that you enter a few keywords to monitor, and it’ll email you whenever those keywords show up on Reddit or Hacker News. I’m hoping to add a few more sites soon.

This post is going to be about how I built the service. Specifically, we’re going to focus on scraping every post and comment from Reddit in real time.

F5Bot is written in PHP. It’s programmed in a very boring, straight-forward manner. I’m not doing anything special, really, but it does manage to scan every Reddit post and comment in real-time. And it does it all on a tiny low-end VPS. Sometimes people are surprised when I tell them this. “Reddit has huge volume,” they say, “Reddit runs a room-full of super computers, and they still have downtime like every day! How do you it?” Well, I’m going to tell you.

Reddit gets about 500,000 new posts and 4,000,000 new comments every day. Combined, that’s about 50/second. The posts don’t come in at a constant rate, though, so we’d better be able to handle a few hundred a second. You may think PHP is slow, but it can definitely handle that. I mean how long does it take you to search a few hundred files on your computer? So really, the only key thing here is that we need to write the program in a careful way. We can’t waste resources, but if we’re careful, we’ll be able to handle the traffic.

So let’s get started!

Reddit’s Listing API

Reddit offers a nice JSON API. There’s some good documentation here. I’m pretty sure their site uses it internally to render the HTML pages for visitors. So we’ll just use that to grab the most recent posts. It’s at https://www.reddit.com/r/all/new/.json, and I recommend you open that up in your browser to follow along.

In PHP we’ll just grab it with a simple curl and then decode it with json_decode(). (As a side note, I’m going to completely skip over error handling because it’s so boring. I mean it’s the most important part of running an actual real-live server, but it’s boring so I won’t talk about it anymore here.)

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.reddit.com/r/all/new/.json");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'testbot2000'); /* Don't forget to name yourself! */
$posts = curl_exec($ch);
curl_close($ch);
$posts = json_decode($posts, true);
print_r($posts);

Here’s an example showing the basic structure of the JSON you’ll get back. I have trimmed it down by leaving only 2 of the 25 posts and by removing most of the key/value pairs (there’s a lot of them).

{
  "kind": "Listing",
  "data": {
    "modhash": "5b1dwrfbn4a9f8c200824114ecc7110ee2baa4d95dc2c106b9",
    "dist": 25,
    "children": [
      {
        "kind": "t3",
        "data": {
          "approved_at_utc": null,
          "subreddit": "Guitar",
          "selftext": "I practice my guitar every day.   Most every day, I ...",
          "user_reports": [],
          "saved": false,
          "mod_reason_title": null,
          "gilded": 0,
          "clicked": false,
          "title": "[DISCUSSION] Thanks to everyone here",
          "subreddit_name_prefixed": "r/Guitar",
          "hidden": false,
          "id": "91sed5",
          /* Many more key/value pairs. */
        }
      },
      {
        "kind": "t3",
        "data": {
          "approved_at_utc": null,
          "subreddit": "gpdwin",
          "selftext": "What would be the cheapest way to purchase a unit in ...",
          "user_reports": [],
          "saved": false,
          "mod_reason_title": null,
          "gilded": 0,
          "clicked": false,
          "title": "Looking for advice on buying a WIN 2 in the UK",
          "subreddit_name_prefixed": "r/gpdwin",
          "hidden": false,
          "id": "91secm",
          /* Many more key/value pairs. */
        },
        /* Many more objects. */
      }
    ],
    "after": "t3_91secm",
    "before": null
  }
}

Depending on how you have PHP setup, curl may have trouble with the secure HTTPS connection. If that’s the case you can bypass it with curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false).

Now we have all the recent posts in $posts as a nice clean PHP array. This is going to be a short article, we’re pretty much done.

But, wait, how many posts did we get? I only want to my scraper to run every few minutes, so it’d better be a lot.

print("Downloaded " . count($posts['data']['children']) . " posts.\n");

Downloaded 25 posts.

Well, that’s a bit tedious. I only want to run my scraper every few minutes, not constantly. So we’d better get at least a minute worth of posts. That’s 300 posts, let’s grab 1,000 just to be safe. We can just add a ?limit=1000 to the end of the URL.

...
curl_setopt($ch, CURLOPT_URL, "https://www.reddit.com/r/all/new/.json?limit=1000");
...
print("Downloaded " . count($posts['data']['children']) . " posts.\n");

Downloaded 100 posts.

Turns out that Reddit has a limit. It’ll only show you 100 posts at a time.

There’s an easy solution though. If you look closely at the JSON you’ll see that it has an after field with an ID. The after field tells us which post id came before the oldest post here. So we’re saved, really. We can just call the API again, but ask it for the 100 posts starting at that post id!

$after = $posts['data']['after'];
...
curl_setopt($ch, CURLOPT_URL, "https://api.reddit.com/api/info.json?limit=100&id=$after");
...

This works well up to a point. That point is about 1,000 posts. After that it will either loop back to the beginning and start showing you posts you’ve already seen, or it will just stop returning anything. I’m not a big fan of that, because if my scraper has a little down time I’d like it to go back and grab the old posts too, but maybe you’re not worried about that.

Now everything I’ve said here about posts is also true of comments. The only difference is that the new ones are at https://www.reddit.com/r/all/comments/.json instead of https://www.reddit.com/r/all/new/.json, and they come in much quicker.

But we’ve got an even bigger problem. Each request takes a few seconds (or much longer if Reddit servers are loaded), and that means we can’t pull comments as quickly as they’re coming in. Remember, comments get posted at a rate of 50 per second, and it can be much more during peak traffic.

Making Multiple Simultaneous Connections

So we’ll have to make a bunch of connections all at once. PHP makes this easy enough with its curl_multi functions, but we’re getting ahead of ourselves. If we’re just pulling the listings, how do we know the id to start the next request on if the current request hasn’t finished?

So each Reddit thing, like a post or comment, has a unique ID. Post IDs start with t3_ and comment IDs start with t1_. The API is very inconsistent about whether it uses the prefix or not. Some places of the API require the prefix, but some just use the ID with no prefix.

The IDs themselves are encoded in base 36 - therefore they use the ten digits, 0-9, and the twenty six letters, a-z. In PHP we can use base_convert() to convert them to decimal and do math with them.

So here we get the most recent post ID, and convert it to decimal.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.reddit.com/r/all/new/.json?limit=1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$posts = curl_exec($ch);
curl_close($ch);

$posts = json_decode($posts, true);
$id = $posts['data']['children'][0]['data']['id'];
$id = base_convert($id,36,10);

From that you can easily do math on $id and then convert it back to Reddit’s format with: base_convert($id,10,36). It’s pretty convenient.

The best part is all the bandwidth you’re going to save by using base 36 instead of 10! You’re going to need it too, since almost the entire Reddit API consists of crap you’ll almost certainly never need. I mean do you really want subreddit name and subreddit_name_prefixed? They’re the same, one just has an “r/” in front of it. In fact, take a look at that JSON again: https://www.reddit.com/r/all/new/.json. Basically I use the selftext, subreddit, permalink, url and title. The other 95% of it is just wasted bandwidth. Actually, over half of the reply is JSON key text. So even if you’re going to use every value, you’re spending more bandwidth just on JSON key names than on the actual data. Oh well, I guess it’s still better than XML.

Anyway, now we can manipulate Reddit IDs, and it turns out that both comments and posts are assigned IDs more or less serially. They’re not strictly monotonic in the short term, but they are in the long term. (i.e. a post from an hour ago will have a lower ID than a post right now, but a post from 2 seconds ago may not)

So we can find where to start each API call by simply subtracting 100 from a starting ID. Using this method we can download a bunch of posts all at once. Here’s the latest 1,000 posts, downloaded simultaneously. It assumes you’ve already loaded the ID of the latest post into $id (in decimal).

$mh = curl_multi_init();
$curls = array();

for ($i = 0; $i < 10; ++$i) {
  $ch = curl_init();
  $url = "https://www.reddit.com/r/all/new/.json?limit=100&after=t3_" . base_convert($id,10,36);
  print("$url\n");
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

  curl_multi_add_handle($mh, $ch);
  $curls[] = $ch;
  $id -= 100;
}

$running = null;
do {
  curl_multi_exec($mh, $running);
} while ($running);

echo "Finished downloading\n";

foreach ($curls as $ch) {
  $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
  $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
  $posts = curl_multi_getcontent($ch);
  $posts = json_decode($posts, true);

  print("Response: $response_code, Time: $response_time, Post count: " .
    count($posts['data']['children']) . ", Starting id: " .
    $posts['data']['children'][0]['data']['id'] . "\n");

  curl_multi_remove_handle($mh, $ch);
}

curl_multi_close($mh);

If you try to push that back past about 1,000 posts you’ll just get empty replies. Assuming that you’re ok with the 1,000 post limit, we’re done. We’ve got the API totally figured out, scraping is a solved problem, and you’re ready to finish building your bot.

Except… if you look at the post IDs you’re getting, you’ll find some funny business. Some posts are repeated. That’s not a big deal, we can ignore them, but also some posts are missing. That’s a big deal! We aren’t slurping all of Reddit if we’re missing posts.

The missing posts are caused by the IDs not being totally in order. Or maybe it’s just some other bug. I don’t know. In any case, there’s nothing we can do about it here that I know of. We’ll need another solution. The 1,000 post limit was really annoying anyway. And the 1,000 comment limit meant we were going to need to scrape every 10 seconds just to avoid missing anything. That wasn’t really viable long-term.

A Completely New Approach

Reddit’s listing APIs are terrible for trying to see every post and comment. To be fair, they probably weren’t designed for that. I guess they are just designed to work with their website to render pages. No real human would want to keeping clicking past 1,000 posts.

So here’s the approach I ended up using, which worked much better: request each post by its ID. That’s right, instead of asking for posts in batches of 100, we’re going to need to ask for each post individually by its post ID. We’ll do the same for comments.

This has a few huge advantages. First, and most important, we won’t miss any posts. Second, we won’t get any post twice. Third, we can go back in time as far as we want. This means we can run our scraper in huge batches every few minutes. If there is downtime, we can go back and get the old posts that we missed. It’s perfect.

Here’s how we do it. We find the starting post ID, and then we get posts individually from https://api.reddit.com/api/info.json?id=. We’ll need to add a big list of post IDs to the end of that API URL.

So here are 2,000 posts, spread out over 20 batches of 100 that we download simultaneously. It assumes you’ve already got the last post ID loaded into $id base-36.

print("Starting id: $id\n");

$urls = array();
for ($i = 0; $i < 20; ++$i) {
  $ids = array();
  for ($j = 0; $j < 100; ++$j) {
    $ids[] = "t3_" . $id;
    $id = base_convert((base_convert($id,36,10) - 1), 10, 36);
  }
  $urls[] = "https://api.reddit.com/api/info.json?id=" . implode(',', $ids);
}



$mh = curl_multi_init();
$curls = array();

foreach ($urls as $url) {
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_USERAGENT, "This is actually needed!");
  print("$url\n");
  curl_multi_add_handle($mh, $ch);
  $curls[] = $ch;
}

$running = null;
do {
  curl_multi_exec($mh, $running);
} while ($running);

print("Finished downloading\n");

foreach ($curls as $ch) {
  $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
  $response_time = curl_getinfo($ch, CURLINFO_TOTAL_TIME);
  $posts = json_decode(curl_multi_getcontent($ch), true);

  print("Response: $response_code, Time: $response_time, Post count: " .
    count($posts['data']['children']) . ", Starting id: " .
    $posts['data']['children'][0]['data']['id'] . "\n");

  curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);

We’ll end up with some pretty long URLs, such as this, but Reddit doesn’t seem to mind.

Also please note that we actually need to set the user-agent for the “info.json” API point. Apparently Reddit has blocked PHP/Curl from it. It’s probably a good idea to set the user-agent for everything; I’ve only omitted it above to shorten the code samples. See Reddit’s API access rules for more info.

From the above script, you’ll get an output like this:

Finished downloading
Response: 200, Time: 5.735, Post count: 96, Starting id: 90yv0w
Response: 200, Time: 7.36, Post count: 99, Starting id: 90yuy4
Response: 200, Time: 6.532, Post count: 100, Starting id: 90yuvc
Response: 200, Time: 6.547, Post count: 99, Starting id: 90yusk
Response: 200, Time: 7.344, Post count: 99, Starting id: 90yups
Response: 200, Time: 7.25, Post count: 94, Starting id: 90yun0
Response: 200, Time: 6.672, Post count: 83, Starting id: 90yuk8
Response: 200, Time: 7.469, Post count: 97, Starting id: 90yuhg
Response: 200, Time: 6.344, Post count: 91, Starting id: 90yueo
Response: 200, Time: 7.187, Post count: 97, Starting id: 90yubw
Response: 200, Time: 22.734, Post count: 96, Starting id: 90yu94
Response: 200, Time: 7.453, Post count: 93, Starting id: 90yu6c
Response: 200, Time: 7.359, Post count: 99, Starting id: 90yu3k
Response: 200, Time: 7.812, Post count: 100, Starting id: 90yu0s
Response: 200, Time: 7.703, Post count: 100, Starting id: 90yty0
Response: 200, Time: 6.375, Post count: 99, Starting id: 90ytv8
Response: 200, Time: 6.734, Post count: 97, Starting id: 90ytsg
Response: 200, Time: 7.328, Post count: 97, Starting id: 90ytpo
Response: 200, Time: 7.812, Post count: 98, Starting id: 90ytmw
Response: 200, Time: 7.812, Post count: 98, Starting id: 90ytk4

So even though we asked for posts in batches of 100, many batches are short. There are two reasons for this. First, some posts are going to be in private communities you don’t have access too. You won’t be able to see them, there’s nothing you can do about that. Second, because post IDs aren’t assigned in perfect order, it could be that some of the missing posts haven’t been written yet. In that case, we just wait a few seconds and do another request with the missing IDs. Not a big deal.

By the way, the above code also works perfectly for comments too. The only difference is that comment IDs start with t1_ instead of t3_.

The other issue you may run into when loading comments, is that the API doesn’t return the title of the post that the comment is attached to. If you need this (F5Bot does) you’ll have to use the parent_id field to get the parent post’s ID. Then you’ll need a separate call to load up that post just so you can grab its title. Luckily you can batch a bunch of these calls together into one request.

Working the Data

After I got it figured out, scraping all of Reddit in real-time wasn’t that tough. I launched F5Bot and it worked fine for a long time. Eventually, however, I ran into a second problem. Processing the data became the bottle-neck. Remember, I’m doing this on a tiny VPS. It has a fast connection, but an anemic CPU.

F5Bot has to search every post and comment for all of the keywords that all of my users have. So it started out as something like this:

foreach ($new_posts as $post) {
    foreach ($all_keywords as $keyword) {
        if (strpos($post, $keyword) !== FALSE) {
            ... found a relevant post ...
        }
    }
}

As you can imagine, I get more users, more keywords, and eventually I’m searching each post for thousands of keywords. It gets a bit slow.

Eventually I converted over to using the Aho-Corasick string searching algorithm. It’s really slick. You put your keywords into a tree structure as a pre-processing step. Then you only need to look at each post one time to see which keywords it contains.

I couldn’t find an Aho-Corasick implementation in PHP, so I wrote my own. It’s up on Github here.

Conclusion

Scraping all of Reddit in real-time doesn’t require a lot of processing power. You just have to be careful and know how to work around their APIs. If do you implement any of these ideas, please be sure to follow Reddit’s API access rules. It’s cool of them to offer a public API.

I hope you enjoyed the write-up! Special thanks to Intoli and Evan Sangaline for the idea to write this article and for hosting it here.

If you’d like to know when people on Reddit or Hacker News are talking about you, your company, or your product, please give F5Bot a try. It’s free.

If you want to read more of my posts you can check out my blog at https://codeplea.com or follow me on Github.

Reddit’s Listing API

Making Multiple Simultaneous Connections

A Completely New Approach

Working the Data

Conclusion

Suggested Articles

Performing Efficient Broad Crawls with the AOPIC Algorithm

User-Agents — Generating random user agents using Google Analytics and CircleCI

A Slack Community for Developers to Discuss Web Scraping

Comments

Search

Tags

How F5Bot Slurps All of Reddit

Reddit’s Listing API

Making Multiple Simultaneous Connections

A Completely New Approach

Working the Data

Conclusion

Suggested Articles

Performing Efficient Broad Crawls with the AOPIC Algorithm

User-Agents — Generating random user agents using Google Analytics and CircleCI

A Slack Community for Developers to Discuss Web Scraping

Comments

Search

Tags