Web Scraping Using Node.js

Zain Nisar
Dec 29, 2015
17.4k
0
0
Article

Web Scraping

Web Scraping is the software technique of extracting the information server side web applications. In this article we will see how things work by simply creating a web scrapper using the DOM Parsing technique and tool which I am using is Node.js.

Before we proceed, I want you to be aware of the following concepts.

Serialization and Deserialization

Serialization is the process of converting an object into a stream of bytes in order to store the object or transmit it to memory, a database, or a file. Its main purpose is to save the state of an object in order to be able to recreate it when needed. The reverse process is called Deserialization.

So the data of web is serialized from the web and then we use deserialization to get that data.

JSON

JavaScript Object Notation or JSON is a syntax for storing and exchanging the data and is easier to use alternative to XML. JSON is language independent and light weight data interchange format.

We are going to use JSON in our process. Our data will be in JSON format.

Node.js

An open source, cross-platform runtime environment for developing server side web application. Node.js will be our tool during our scrapping process.

Request and Cheerio

Request and Cheerio are our npm packages. Cheerio doesn’t try to emulate a full implementation of the DOM. It specifically focuses on the scenario where you want to manipulate an HTML document using jQuery-like syntax. As such, it compares to jsdom favorably in some cases, but not in every situation.

Cheerio itself doesn’t include a mechanism for making HTTP requests, and that’s something that can be tedious to handle manually. It’s a bit easier to use a module called request to facilitate requesting remote HTML documents. Request handles common tasks like caching cookies between multiple requests, setting the content length on POSTs, and generally makes life easier.

If you don’t understand any of above concepts , simply ignore them and let us create a scrapper from here now.

Set up IdE

I am using the following,

Windows 10 x64 .
Visual Studio 2015(Community )
Visit Node.js and download your installer according to your specifications .

After you have your Node.js installed, activate your Visual Studio 2015 and create a new project there.

Select Template

Now its time to select your template .

Select Node.js
Select Basic Azure Node.js Express 4.
Name it, for instance, MyScrapper

Install NPM Package

Now install your NPM packages, as shown in the image.

After the package is loaded, write request and cheerio and then click install.

Uninstall Jade

When you are done, uninstall Jade .

Changes in APP.js

Go to app.js.
Comment the views as shown in image, as we are not displaying any.

Before

After

When you are done, do some further changes as shown in the image.

Before

After

Request and Cheerio

Go to Routes(node).
Select users.js.
Add the request and cheerio as shown in image.

Website URL

Select the website you want to scrap and save its url in the variable as shown in the image. For instance, I chose bbc.com

Edit Function

Just simply edit your router.get function as shown in the image. The router.get function is shown in the preceding image and you can edit it by writing the code mentioned in the following image,

DOM Parsing

Programs can retrieve the dynamic content generated by client-side scripts, by embedding the browsers. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.

DOM is a language independent, cross-platform convention used for interacting with objects in HTML, XML, XHTML.

Open website you want to scrape in browser.
For instance I opened bbc.com in Google Chrome.
Click Inspect.
Image is there to help you.

Code Function

Code the function now, as you can see in above image that we are traversing the DOM. I have selected the data shown in red circular region and in the inspect window it gives me the relative dom and then you can write code for it

scrapeDataFromHtml is our function and we create variables in the function for every item that we want to scrape from the website and then the data is serialized from website in JSON format and then we have it once deserialization is done. In this case the circular red region gives me its relative node in inspect window.

First we reached url.
Then we traversed DOM.
Select our Nodes ,the desired data we want to scrape
Create your function for instance scrapeDataFromHtml
In this function, store all the data you want to scrape from website in variables .
Write your logic and for multiple values you can use an array.
span and image are two things we want to scrape .

Run Application