React Web Scraper



  • We build an API to fetch data from. Then build a quick frontend with Next.js, Custom React hooks and Context.
  • Web scraping is the process of gathering information from the Internet. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Our advanced scraper continuously monitors and distills global action on any tech topic while our editors are permanently picking the brains of industry shapers. You can filter the dynamic timeline for any technology or industry you’re interested in to see how it will likely develop and consult all the resulted data at a glance. Your renderer process running React (with nodeIntegration turned off) sends a message to the main process telling it to scrape a certain URL. Your main process which has access to all Node.js APIs by default runs the scraper and sends back the result as a message to the renderer process.

Scrapy is a popular open-source Python framework for writing scalable web scrapers. In this tutorial, we’ll take you step by step through using Scrapy to gather a list of Oscar-winning movies from Wikipedia.

Web scraping is a way to grab data from websites without needing access to APIs or the website’s database. You only need access to the site’s data — as long as your browser can access the data, you will be able to scrape it.

Realistically, most of the time you could just go through a website manually and grab the data ‘by hand’ using copy and paste, but in a lot of cases that would take you many hours of manual work, which could end up costing you a lot more than the data is worth, especially if you’ve hired someone to do the task for you. Why hire someone to work at 1–2 minutes per query when you can get a program to perform a query automatically every few seconds?

For example, let’s say that you wish to compile a list of the Oscar winners for best picture, along with their director, starring actors, release date, and run time. Using Google, you can see there are several sites that will list these movies by name, and maybe some additional information, but generally you’ll have to follow through with links to capture all the information you want.

Obviously, it would be impractical and time-consuming to go through every link from 1927 through to today and manually try to find the information through each page. With web scraping, we just need to find a website with pages that have all this information, and then point our program in the right direction with the right instructions.

In this tutorial, we will use Wikipedia as our website as it contains all the information we need and then use Scrapy on Python as a tool to scrape our information.

A few caveats before we begin:

Data scraping involves increasing the server load for the site that you’re scraping, which means a higher cost for the companies hosting the site and a lower quality experience for other users of that site. The quality of the server that is running the website, the amount of data you’re trying to obtain, and the rate at which you’re sending requests to the server will moderate the effect you have on the server. Keeping this in mind, we need to make sure that we stick to a few rules.

Most sites also have a file called robots.txt in their main directory. This file sets out rules for what directories sites do not want scrapers to access. A website’s Terms & Conditions page will usually let you know what their policy on data scraping is. For example, IMDB’s conditions page has the following clause:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express-written consent as noted below.

Before we try to obtain a website’s data we should always check out the website’s terms and robots.txt to make sure we are obtaining legal data. When building our scrapers, we also need to make sure that we do not overwhelm a server with requests that it can’t handle.

Luckily, many websites recognize the need for users to obtain data, and they make the data available through APIs. If these are available, it’s usually a much easier experience to obtain data through the API than through scraping.

Wikipedia allows data scraping, as long as the bots aren’t going ‘way too fast’, as specified in their robots.txt. They also provide downloadable datasets so people can process the data on their own machines. If we go too fast, the servers will automatically block our IP, so we’ll implement timers in order to keep within their rules.

Getting Started, Installing Relevant Libraries Using Pip

First of all, to start off, let’s install Scrapy.

Windows

Install the latest version of Python from https://www.python.org/downloads/windows/

React native web scraper

Note:Windows users will also need Microsoft Visual C++ 14.0, which you can grab from “Microsoft Visual C++ Build Tools” over here.

You’ll also want to make sure you have the latest version of pip.

In cmd.exe, type in:

Web

This will install Scrapy and all the dependencies automatically.

Linux

First you’ll want to install all the dependencies:

In Terminal, enter:

Once that’s all installed, just type in:

To make sure pip is updated, and then:

And it’s all done.

Mac

First you’ll need to make sure you have a c-compiler on your system. In Terminal, enter:

After that, install homebrew from https://brew.sh/.

Update your PATH variable so that homebrew packages are used before system packages:

Install Python:

And then make sure everything is updated:

After that’s done, just install Scrapy using pip:

>

Overview Of Scrapy, How The Pieces Fit Together, Parsers, Spiders, Etc

You will be writing a script called a ‘Spider’ for Scrapy to run, but don’t worry, Scrapy spiders aren’t scary at all despite their name. The only similarity Scrapy spiders and real spiders have are that they like to crawl on the web.

Inside the spider is a class that you define that tells Scrapy what to do. For example, where to start crawling, the types of requests it makes, how to follow links on pages, and how it parses data. You can even add custom functions to process data as well, before outputting back into a file.

Writing Your First Spider, Write A Simple Spider To Allow For Hands-on Learning

To start our first spider, we need to first create a Scrapy project. To do this, enter this into your command line:

This will create a folder with your project.

We’ll start with a basic spider. The following code is to be entered into a python script. Open a new python script in /oscars/spiders and name it oscars_spider.py

We’ll import Scrapy.

We then start defining our Spider class. First, we set the name and then the domains that the spider is allowed to scrape. Finally, we tell the spider where to start scraping from.

Next, we need a function which will capture the information that we want. For now, we’ll just grab the page title. We use CSS to find the tag which carries the title text, and then we extract it. Finally, we return the information back to Scrapy to be logged or written to a file.

Now save the code in /oscars/spiders/oscars_spider.py

To run this spider, simply go to your command line and type:

You should see an output like this:

Congratulations, you’ve built your first basic Scrapy scraper!

Full code:

Obviously, we want it to do a little bit more, so let’s look into how to use Scrapy to parse data.

First, let’s get familiar with the Scrapy shell. The Scrapy shell can help you test your code to make sure that Scrapy is grabbing the data you want.

To access the shell, enter this into your command line:

This will basically open the page that you’ve directed it to and it will let you run single lines of code. For example, you can view the raw HTML of the page by typing in:

Or open the page in your default browser by typing in:

Our goal here is to find the code that contains the information that we want. For now, let’s try to grab the movie title names only.

The easiest way to find the code we need is by opening the page in our browser and inspecting the code. In this example, I am using Chrome DevTools. Just right-click on any movie title and select ‘inspect’:

As you can see, the Oscar winners have a yellow background while the nominees have a plain background. There’s also a link to the article about the movie title, and the links for movies end in film). Now that we know this, we can use a CSS selector to grab the data. In the Scrapy shell, type in:

As you can see, you now have a list of all the Oscar Best Picture Winners!

Going back to our main goal, we want a list of the Oscar winners for best picture, along with their director, starring actors, release date, and run time. To do this, we need Scrapy to grab data from each of those movie pages.

We’ll have to rewrite a few things and add a new function, but don’t worry, it’s pretty straightforward.

We’ll start by initiating the scraper the same way as before.

But this time, two things will change. First, we’ll import time along with scrapy because we want to create a timer to restrict how fast the bot scrapes. Also, when we parse the pages the first time, we want to only get a list of the links to each title, so we can grab information off those pages instead.

Here we make a loop to look for every link on the page that ends in film) with the yellow background in it and then we join those links together into a list of URLs, which we will send to the function parse_titles to pass further. We also slip in a timer for it to only request pages every 5 seconds. Remember, we can use the Scrapy shell to test our response.css fields to make sure we’re getting the correct data!

The real work gets done in our parse_data function, where we create a dictionary called data and then fill each key with the information we want. Again, all these selectors were found using Chrome DevTools as demonstrated before and then tested with the Scrapy shell.

The final line returns the data dictionary back to Scrapy to store.

Complete code:

Sometimes we will want to use proxies as websites will try to block our attempts at scraping.

To do this, we only need to change a few things. Using our example, in our def parse(), we need to change it to the following:

This will route the requests through your proxy server.

Deployment And Logging, Show How To Actually Manage A Spider In Production

Now it is time to run our spider. To make Scrapy start scraping and then output to a CSV file, enter the following into your command prompt:

You will see a large output, and after a couple of minutes, it will complete and you will have a CSV file sitting in your project folder.

Compiling Results, Show How To Use The Results Compiled In The Previous Steps

When you open the CSV file, you will see all the information we wanted (sorted out by columns with headings). It’s really that simple.

With data scraping, we can obtain almost any custom dataset that we want, as long as the information is publicly available. What you want to do with this data is up to you. This skill is extremely useful for doing market research, keeping information on a website updated, and many other things.

It’s fairly easy to set up your own web scraper to obtain custom datasets on your own, however, always remember that there might be other ways to obtain the data that you need. Businesses invest a lot into providing the data that you want, so it’s only fair that we respect their terms and conditions.

Additional Resources For Learning More About Scrapy And Web Scraping In General

  • “The 10 Best Data Scraping Tools and Web Scraping Tools,” Scraper API
  • “5 Tips For Web Scraping Without Getting Blocked or Blacklisted,” Scraper API
  • Parsel, a Python library to use regular expressions to extract data from HTML.
(dm, yk, il)

Web scraping relies on the HTML structure of the page, and thus cannot be completely stable. When HTML structure changes the scraper may become broken. Keep this in mind when reading this article. At the moment when you are reading this, css-selectors used here may become outdated.

In the previous article, we have created a scraper to parse movies data from IMDB. We have also used a simple in-memory queue to avoid sending hundreds or thousands of concurrent requests and thus to avoid being blocked. But what if you are already blocked? The site that you are scraping has already added your IP to its blacklist and you don’t know whether it is a temporal block or a permanent one.

Such issues can be resolved with a proxy server. Using proxies and rotating IP addresses can prevent you from being detected as a scraper. The idea of rotating different IP addresses while scraping - is to make your scraper look like real users accessing the website from different multiple locations. If you implement it right, you drastically reduce the chances of being blocked.

React Web Scraper Examples

In this article, I will show you how to send concurrent HTTP requests with ReactPHP using a proxy server. We will play around with some concurrent HTTP requests and then we will come back to the scraper, which we have written before. We will update the scraper to use a proxy server for performing requests.

How to send requests through a proxy in ReactPHP

For sending concurrent HTTP we will use clue/reactphp-buzz package. To install it run the following command:

Now, let’s write a simple asynchronous HTTP request:

We create an instance of ClueReactBuzzBrowser which is an asynchronous HTTP client. Then we request Google web page via method get($url). Method get($url) returns a promise, which resolves with an instance of PsrHttpMessageResponseInterface. This snippet above requests http://google.com and then prints its HTML.

For a more detailed explanation of working with this asynchronous HTTP client check this post.

Class Browser is very flexible. You can specify different connection settings, like DNS resolution, TSL parameters, timeouts and of course proxies. All these settings are configured within an instance of ReactSocketConnector. Class Connector accepts a loop and then a configuration array. So, let’s create one and pass it to our client as a second argument.

This connector tells the client to use 8.8.8.8 for DNS resolution.

Web

Before we can start using proxy we need to install clue/reactphp-socks package:

This library provides SOCKS4, SOCKS4a and SOCKS5 proxy client/server implementation for ReactPHP. In our case, we need a client. This client will be used to connect to a proxy server. Then our main HTTP client will use this proxy client to send connections through a proxy server.

Notice, that this 127.0.0.1:1080 is just a dummy address. Of course, there is no proxy server running on our machine.

The constructor of ClueReactSocksClient class accepts an address of the proxy server (127.0.0.1:1080) and an instance of the Connector. We have already covered Connector above. Create an empty connector here, with no configuration array.

Name ClueReactSocksClient can confuse you, that it is one more client in our code. But it is not the same thing as ClueReactBuzzBrowser, it doesn’t send requests. Consider it as a connection, not a client. The main purpose of it is to establish a connection to a proxy server. Then the real client will use this connection to perform requests.

To use this proxy connection we need to update a connector and specify tcp option:

The full code now looks like this:

Now, the problem is: where to get a real proxy?

Let’s find a proxy

On the Internet, you can find many sites dedicated to providing free proxies. For example, you can use https://www.socks-proxy.net. Visit it and pick a proxy from Socks Proxy list.

In this tutorial, I use 184.178.172.13:15311.

Probably when you read this article this particular proxy wouldn’t work. Please, pick another proxy from the site I mentioned above.

Now, the working example looks like this:

Notice, that I have added an onRejected callback. A proxy server might not work (especially a free one), thus it would be useful to show an error if our request has failed. Run the code and you will see HTML code of Google main page.

Updating the scraper

To refresh the memory here is the consumer code of the scraper from the previous article:

Web scraper software

We create an event loop. Then we create an instance of ClueReactBuzzBrowser. The scraper uses this instance to perform concurrent requests. We scrape two URLs with 40 seconds timeout. As you can see we even don’t need to touch the scraper’s code. All we need is to update Browser constructor and provide a Connector configured for using a proxy server. At first, create a proxy client with an empty connector:

Then we need a new connector for Browser with a configured tcp option, where we provide our client:

And the last step is to update Browser constructor by providing a connector:

The updated proxy version looks the following:

But, as I have mentioned before proxies might not work. It will be nice to know why we have scrapped nothing. So, it looks like we still have to update a scraper’s code and add errors handling. The part of the scraper which performs HTTP requests looks the following:

The request logic is located inside scrape() method. We loop through specified URLs and perform a concurrent request for each of them. Each request returns a promise. As an onFulfilled handler, we provide a closure where the response body is being scraped. Then, we set a timer to cancel a promise and thus a request by timeout. One thing is missing here. There is no error handling for this promise. When the parsing is done there is no way to figure out what errors have occurred. It will be nice to have a list of errors, where we have URLs as keys and appropriate errors as values.So, let’s add a new $errors property and a getter for it:

Then we need to update method scrape() and add a rejection handler for the request promise:

When an error occurs we store it inside $errors property with an appropriate URL. Now we can keep track of all the errors during the scraping. Also, before scrapping don’t forget to instantiate $errors property with an empty array. Otherwise, we will continue storing old errors. Here is an updated version of scrape() method:

Now, the consumer code can be the following:

At the end of this snippet, we print both scraped data and errors. A list of errors can be very useful. In addition to the fact that we can track dead proxies, we can also detect whether we are banned or not.

What if my proxy requires authentication?

All these examples above work fine for free proxies. But when you are serious about scraping chances high that you have private proxies. In most cases they require authentication. Providing your credentials is very simple, just update your proxy connection string like this:

But keep in mind that if you credentials contain some special characters they should be encoded:

You can find examples from this article on GitHub.

React

This article is a part of the ReactPHP Series.

Learning Event-Driven PHP With ReactPHP

The book about asynchronous PHP that you NEED!

A complete guide to writing asynchronous applications with ReactPHP. Discover event-driven architecture and non-blocking I/O with PHP!

Review by Pascal MARTIN

React Web Scraper

Minimum price: 5.99$