Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. How to scrape reddit using python scrapy Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease. Today lets see how we can scrape Reddit to get new posts from a subreddit like r/programming. How to Scrape Reddit Using Python, Requests, and Beautifulsoup. As I stated earlier, Reddit provides a nice API that can be used for extracting data from web pages on Reddit. Before you even think of scraping publicly available data from Reddit, you need to confirm that the API they provide is not helpful.
The aim of this article is to get you started on a real-world problem solving while keeping it super simple so you get familiar and get practical results as fast as possible.
So the first thing we need is to make sure we have Python 3 installed. If not, you can just get Python 3 and get it installed before you proceed.
Then you can install beautiful soup with..
We will also need the libraries requests, lxml and soupsieve to fetch data, break it down to XML, and to use CSS selectors. Install them using..
Once installed open an editor and type in.
Now let's go to the programming subreddit and inspect the data we can get.
This is how it looks:
Back to our code now.. Let's try and get this data by pretending we are a browser like this.
Save this as reddit_bs.py.
If you run it.
Reddit Web Scraping Python Interview
You will see the whole HTML page.
Now, let's use CSS selectors to get to the data we want. To do that let's go back to Chrome and open the inspect tool. You can see that all the post title elements have a class called review-title in them.
Let's use CSS selectors to get this data like so.
This will print the title of the first post. We now need to get to all the posts. We notice that the class 'Post' (amongst others) holds all the individual data together.
Free version of minecraft for mac.
To get to them individually we run through them like this and try and get to the post title from 'inside' the 'Post'
And when you run it you get.
Bingo!! we got the post titles.
Now with the same process, we get the class names of all the other data like post votes, a number of comments, a link to it, etc.
That when run, should print everything we need from each post like this.
If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Reddit. In this scenario using a rotating proxy service to rotate IPs is almost a must.
Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
- With millions of high speed rotating proxies located all over the world,
- With our automatic IP rotation
- With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
- With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Scrapy is one of the most accessible tools that you can use to scrape and also spider a website with effortless ease.
Today lets see how we can scrape Reddit to get new posts from a subreddit like r/programming.
First, we need to install scrapy if you haven't already.
Once installed, go ahead and create a project by invoking the startproject command.
This will ouput something like this.
And create a folder structure like this..
Now CD into the scrapingproject. You will need to do it twice like this.
Now we need a spider to crawl through the programming subreddit. So we use the genspider to tell scrapy to create one for us. We call the spider ourfirstbot and pass it the url of the subreddit
this should return successfull like this
Great. Now open the file ourfirstbot.py in the spiders folder.. it should look like this..
Lets examine this code before we proceed..
The allowed_domains array restricts all further crawling to the domain paths specified here.
start_urls is the list of urls to crawl.. for us, in this example, we only need one url.
The def parse(self, response): function is called by scrapy after every successfull url crawl. Here is where we can write our code to extract the data we want.
We now need to find the css selector of the elements we need to extract the data. Go to the url https://www.reddit.com/r/programming/ and right click on the Title of one of the posts and click on inspect. This will open thje Google Chrome Inspector like below..
You can see that the css class name of the title is _eYtD2XCVieq6emjKBH3m so we are going to ask to ask scrapy to get us the text property of this class like this.
Similarly, we try and find the class names of the votes element and the number of comments element (note that the class names might change by the time you run this code. Classic winamp for mac.
If you are unfaimiliar with css selectors, you can refer to this page by Scrapy https://docs.scrapy.org/en/latest/topics/selectors.html
We have to now use the zip function to map the similar index of multiple containers so that they can be used just using as single entity. so here is how it looks.
And now lets run this with the command .
And Bingo.. Xmyey for mac. you get the results as below.
Reddit Web Scraping Python Tutorial
Now lets export the extracted data to a csv file. All you have to do is to provide the export file like this
or if you want the data in the JSON format..
Scaling Scrapy
The example above is ok for small scale web crawling projects. But if you try to scrape large quantities of data at high speeds from websites like Reddit you will find that sooner or later your access will be restricted. Reddit can tell you are a bot so one of the things you can do is to run the crawler impersonating a web browser. This is done by passing the user agent string to the Reddit webserver so it doesnt block you.
Like this..
In more advanced implementations you will need to even rotate this string so Reddit cant tell its the same browser! Welcome to web scraping.
If we get a little bit more advanced, you will realise that Reddit can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.
Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache free web scraping project which gets the job done consistently and one that never really works.
Plus with the 1000 free API calls running offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
With millions of high speed rotating proxies located all over the world,
With our automatic IP rotation
With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
With our automatic CAPTCHA solving technology,
hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
Once you have an API_KEY from Proxies API, you just have to change your code to this..
Reddit Web Scraping Python Tutorial
We have only changed one line at the start_urls array and that will make sure we will never have to worry about IP rotation, user agent string rotation or even rate limits ever again.