A Slack Community for Developers to Discuss Web Scraping

By Evan Sangaline | June 21, 2018

The Web Scrapers Slack Community

Want to link up with other developers interested in web scraping? Join the Web Scrapers Slack Channel to chat about Selenium, Puppeteer, Scrapy, or anything else related to web scraping.

The last few years have been a very exciting time for web scraping. In that period, both Chrome and Firefox have introduced memory efficient headless modes which allow them to run on Linux servers without requiring X11 and a virtual framebuffer like xvfb. These developments effectively sunset headless webkit renderers like PhantomJS overnight, and served as a catalyst for the web scraping community to shift their attention towards browser automation frameworks which can control Chrome and Firefox. Selenium is of course the Old Faithful of the browser automation space, but we’ve additionally seen the rise of many exciting new frameworks with more elegant and user-friendly APIs. Puppeteer has become a favorite for Chromium automation using the DevTools Protocol, and Remote Browser has made it possible to target cross-browser automation without the overhead of WebDriver by leveraging the power of the recently de facto standardized Web Extensions API.

As developers have been drawn towards using full browsers for web scraping and automation by the increasing quality of the associated tooling, they’ve also been pushed there from the other direction. The massive popularity of frontend JavaScript frameworks like React, Vue, and Angular have made JavaScript a necessity for scraping many websites. At the same time, the proportion of sites using anti-scraping bot-mitigations services from Distil, Incapsula, and Akamai has grown tremendously. This has provoked a titillating arms race of opposing detection and concealment strategies for headless browsers.

The combination of bot-mitigation services with the rise of companies like LinkedIn using abusive litigious tactics has also pushed web scrapers towards more sophisticated tactics. Mobile SSL unpinning tools like Inspeckage, and proxy tools such as mitmproxy and Charles, are now widely used to reverse engineer internal APIs from mobile apps. This approach can be very difficult to detect, especially when used in conjunction with residential proxies which have been spurred on by the meteoric rise of Hola VPN (now Luminati).

Despite so many exciting developments in the web scraping space, we’ve found that there hasn’t really been a general forum for developers to talk about them. A lot of the discussion is spread across IRC channels devoted to single tools, blog posts and their comment sections, Hacker News threads, and GitHub issues. That inspired us to put together a new public Slack community for people involved with web scraping to ask questions, share stories and news, or discuss anything else related to web scraping.

I’ve mostly mentioned some of the more recent developments in web scraping here, but The Web Scrapers Slack Community is really open for discussion about anything related to web scraping. More traditional scraping tools like Scrapy, Wombat, and Colly certainly still have their place in modern web scraping, and you should feel free to discuss them in the Slack channel as well. It would also be great for developers to share their experiences with different commercial services like Import.io, DiffBot, and Connotate.

After entering your email address below, you’ll immediately be sent an invitation to the Slack community. We look forward to chatting with you!

The Web Scrapers Slack Community

Want to link up with other developers interested in web scraping? Join the Web Scrapers Slack Channel to chat about Selenium, Puppeteer, Scrapy, or anything else related to web scraping.

Suggested Articles

If you enjoyed this article, then you might also enjoy these related ones.

Performing Efficient Broad Crawls with the AOPIC Algorithm

By Andre Perunicic
on September 16, 2018

Learn how to estimate page importance and allocate bandwidth during a broad crawl.

Read more

User-Agents — Generating random user agents using Google Analytics and CircleCI

By Evan Sangaline
on August 30, 2018

A free dataset and JavaScript library for generating random user agents that are always current.

Read more

How F5Bot Slurps All of Reddit

By Lewis Van Winkle
on July 30, 2018

The creator of F5Bot explains in detail how it works, and how it's able to scrape million of Reddit comments per day.

Read more