So, first let us all know what Web Scrapping is.
Web Scraping (also known as Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a process to scrap (extract) data from web pages, web document, HTML document or XML document in a structured manner and then store in database, text file as a spreadsheet, as an XML, or in whatever file format we want to save in.
Python Libraries we will be using here -
- requests
- BeautifulSoup
Let us know in short about these two libraries.
Requests
Basically, Requests is an Apache2 Licensed HTTP library written in Python. It allows us to send HTTP/1.1 requests. With it, we can add content like headers, form data, multipart files, and parameters. It allows us to access the response data of Python.
Google it to know more.
BeautifulSoup
In short, BeautifulSoup is a library which is written in Python for extracting data from HTML and XML files. This Python library is mostly used while scrapping data from HTML and XML.
Google it to know more.
Commands to install these two libraries.
$ pip install requests==2.18.4
$ pip install beautifulsoup4==4.6.0
Now, let us name the Python web scrapping program.
Program Name - web_scrap.py
First, let us know what web_scrap.py program will do. We will be creating this Python program to scrap recent posts from the C# Corner website, i.e., http://www.c-sharpcorner.com.
Full Code Snippet to scrap recent posts from c-sharpcorner.
- import requests
- from bs4 import BeautifulSoup
- result = requests.get('http://www.c-sharpcorner.com/')
- print (result.status_code)
- print (result.headers)
- csharpcorner = result.content
- scrap = BeautifulSoup(csharpcorner,'html.parser')
- post = scrap.find_all("div",attrs={'class':'post'})
- for posts in post:
- postBody = posts.find_all('div',attrs={'class':'media-body'})
- for i in range(len(postBody)):
- print postBody[i].find('a', attrs={'class':'title'}).string.strip()
- print 'Category: ',postBody[i].find('a', '').string.strip()
- print 'By: ',postBody[i].find('a', attrs={'class':'author'}).string.strip()
- print 'Date and Time: ',postBody[i].find('span','hidden-xs').string.strip()
- print '----------------------------------------------------------------------------------------------'
Now, let’s understand the code. Understanding the above code is pretty simple if we break the code in some parts and then try to run the program. In this way, we can also understand the large code programs, just by breaking the code in some part and then try to run.
Let us start to understand the code.
- To use libraries, we need to import it then we can use it. That’s the reason we have first imported requests and BeautifulSoup.
- We created a variable result and we are making GET request to C-SharpCorner website, using the help of python requests library and we store the response to the variable result.
- Now, by using the python print function we are checking the response code, to check whether we get a response from website or not. Response Code 200 means OK, we get response from c-sharpcorner.com
- Now, using the print function we are printing out response header.
What is response header?
The information, in the form of a text record, that a Web server sends back to a client's browser in response to receiving an HTTP request. The response header contains the date, size, and type of file that the server is sending back to the client and also data about the server itself. The header is attached to the files being sent back to the client.
- Now we have created a variable name c-sharpcorner and we are storing the whole website content into it. You can check it if you print out c-sharpcorner variable.
- Now, we have created a variable name scrap which holds the full content of c-sharpcorner and we have defined HTML.parser because we know BeautifulSoup can parse HTML and XML file.
- Now we have created a variable name post and by using the BeautifulSoup function find_all we are trying to find all the div in the website but with the class name post. And inside our class name "post", Recent Post resides.
- Now our variable post stores array of Recent Post from c-sharpcorner. If you print out variable post you will come to know.
- Now we write a for loop to loop inside the Recent Post and print out the recent post, but we don’t know how many recent posts are there on the website.
- So to know this, we created a variable name postBody which targets the div with class name media-body ( which has recent post details like title, author, category, date and time).
- Now we write another for loop and we find the length of recent post and targets all the class, id , html tag, where our desired data is and in this way we print out all the Recent Posts from c-sharpcorner.
- This is how we did our web scrapping in a nested loop.
Now, let us run the program using the command:
$ python web_scrap.py
Hope you guys like this tutorial.
In this way, we can target HTML tag, class, id and scrap any part of data from HTML web page.
You all can also store the scraped data into a file instead of printing it out in the terminal.
Want to know how to store data in a file using Python, comment down below.
Thank You!!