Getting Started With HTML Agility Pack

This article shows how to get started with HTML Agility Pack and provides code samples to see how web scraping can be done using this package in C#. For users who are unafamiliar with “HTML Agility Pack“, this is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a .NET code library that allows you to parse “out of the web” files (be it HTML, PHP or aspx).

To make it simpler, you can scrape web pages present on the internet using this library.

How to Get HTML Agility Pack in your application

You can get HTML Agility Pack in your application using Nuget. To install it in your project, you can just use the following in the Package Manager Console.

  1. Install-Package HtmlAgilityPack  

Read this

How to add Nuget packages in your project

After adding the reference via Nuget, you need to include the reference in your page using the following.

  1. using HtmlAgilityPack; 

Load a Page From Internet

To load a page directly from the web, you can use the following code:

  1.  HtmlWeb web = new HtmlWeb();  
  2.  HtmlDocument document = web.Load("http://www.c-sharpcorner.com"); 

After executing this 2 lines of code, we have the entire page of http://c-sharpcorner.com in a document object of HtmlDocument class.

Load a Page from a Saved Document

Several times we need to load a HTML document from a saved file from our hard disk. To load a HTML document from a saved file, we need to write the following code.

  1. HtmlDocument document2 = new HtmlDocument();  
  2. document2.Load(@"C:\Temp\sample.txt"); 

At this point, we have the entire HTML parsed and loaded in the document2 object.

At this point, let us see a sample HTML that we're using in the following sample.txt file.

  1. <html>  
  2. <head>  
  3. </head>  
  4. <body>  
  5.     <div id="div1">  
  6.         <a href="div1-a1">Link 1 inside div1</a>  
  7.         <a href="div1-a2">Link 2 inside div1</a>  
  8.     </div>  
  9.     <a href="a3">Link 3 outside all divs</a>      
  10.     <div id="div2">  
  11.         <a href="div2-a1">Link 1 inside div2</a>  
  12.         <a href="div2-a2">Link 2 inside div2</a>  
  13.     </div>  
  14. </body>  
  15. </html> 

Get all Hyperlinks in a page

Once we have the HTML document loaded, let us see how to get all the hyperlinks from the page.

  1. HtmlDocument document2 = new HtmlDocument();  
  2. document2.Load(@"C:\Temp\sample.txt")  
  3. HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").ToArray();  
  4. foreach (HtmlNode item in nodes)  
  5. {  
  6.     Console.WriteLine(item.InnerHtml);  

This will output the following text,

html-agility-pack-1

Select a specific div in a page

To get a specific div in a page, we will use the following code :

  1. HtmlDocument document2 = new HtmlDocument();  
  2. document2.Load(@"C:\Temp\sample.txt")  
  3. HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First(); 

This code will select the div with the id "div1′ from the page and return in the Node. You can now iterate on the ChildNodes property of the HtmlNode class to get further child elements of the DOM element.

Select all Hyperlinks within a specific div

To select all hyperlinks within a specific div, we can use the following 2 ways,

  1. HtmlDocument document2 = new HtmlDocument();  
  2. document2.Load(@"C:\Temp\sample.txt")  
  3.    
  4. //Approach 1  
  5. HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First();  
  6.    
  7. HtmlNode [] aNodes = node.SelectNodes(".//a").ToArray();  
  8.    
  9. //Approach 2  
  10. HtmlNode [] aNodes2 = document2.DocumentNode.SelectNodes("//div[@id='div1']//a").ToArray(); 

The preceding code will give the following output,

html-agility-pack-2

Filter hyperlinks for certain conditions

In case you want to filter nodes based on conditions, you can also use LINQ to do any kind of query on the nodes and return your specific nodes. For example, the following code will return all the hyperlinks where the anchor tags contain "div2" in their link text.

  1. HtmlDocument document2 = new HtmlDocument();  
  2. document2.Load(@"C:\Temp\sample.txt");  
  3.    
  4. HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").Where(x=>x.InnerHtml.Contains("div2")).ToArray();  
  5. foreach (HtmlNode item in nodes)  
  6. {  
  7.     Console.WriteLine(item.InnerHtml);  

The preceding code will give the following output,

html-agility-pack-3

I hope this article gives you a head start with HTML Agility Pack. If you have any questions, please mention in the comments section.

Next Recommended Readings
Rebin Infotech
Think. Innovate. Grow.