8
Answers

read HTML tag text

darma teja

darma teja

11y
1.1k
1
Hi All,

I would like to get the following output from htm ot html files:

1. MY WEBSITE (Means to get text between TITLE tags)
2. left (Means to align values of "p" tag)
3. boat.gif (Means to src values of "img" tag)


<!DOCTYPE html>
<html>
<HEAD>
<TITLE>MY WEBSITE</TITLE>
<7HEAD>
<body>
<h1>My First Heading</h1>
<p align="left">My first paragraph.</p>
<img src="boat.gif" alt="Big Boat">
</body>
</html>

Thanks Allot in Advance

Darma
Answers (8)
0
Vulpes
NA 98.3k 1.5m 11y
If you have multiple <img> tags, you can do:

 string regExp3 = @"<img src=""(.*?)"""; // as before
 MatchCollection mc = Regex.Matches(html, regExp3);
 foreach(Match m in mc) Console.WriteLine(m.Groups[1].Value);



Accepted
0
darma teja
NA 493 194.2k 11y
Hi Vulpes,

If i have more than one image, how can get all images with their path.

Thanks,

Darma


0
Raj Bandi
NA 2.5k 288.7k 11y
Hi,

Im not sure why it wasn't working for you. HtmlAgilityPack is well tested and is fast. 

Here is the code 


static void Main(string[] args)
{

var doc = new HtmlDocument();
doc.Load("e:\\temp\\dharma.html");

string align = string.Empty,src = string.Empty,title = string.Empty;

var titleNode = doc.DocumentNode.SelectSingleNode("//title");
if(titleNode != null)
{
title = titleNode.InnerText;
}

var p = doc.DocumentNode.SelectSingleNode("//p[@align]");
if(p != null)
{
align = p.Attributes["align"].Value;
}
var img = doc.DocumentNode.SelectSingleNode("//img[@src]");
if(img != null)
{
src = img.Attributes["src"].Value;
}

Console.WriteLine(title);
Console.WriteLine(align);
Console.WriteLine(src);
}

0
darma teja
NA 493 194.2k 11y
Hi Vulpes and Raj,

Thanks for the replies.

@Vulpes: Perfectly, It is working.

@ Raj: First I tried with "HtmlAgilityPack", Unfortunately it was not working. 


Thanks,

Darma
0
Raj Bandi
NA 2.5k 288.7k 11y
Hmm,

Sorry my bad incorrectly interpreted, I thought it was on the client side in browser, so you want read this in C#, 

Use HtmlAgilityPack  http://htmlagilitypack.codeplex.com/, add references to your project.

var doc = new HtmlDocment();
doc.load("yourfile.htm");


Use xpath queries something like this
var p = doc.DocumentElement.SelectNodes("//p[@align]");

0
Vulpes
NA 98.3k 1.5m 11y
Here's a different approach using regular expressions.

As I don't know what type of application you're writing, I've used a console application for illustration:

using System;
using System.IO;
using System.Text.RegularExpressions;

class Test
{
   static void Main()
   {
      string html = File.ReadAllText("darma.html");

      string regExp1 = "<TITLE>(.*?)</TITLE>";
      string title = Regex.Match(html, regExp1).Groups[1].Value;
      Console.WriteLine(title);

      string regExp2 = @"<p align=""(.*?)"">";
      string align = Regex.Match(html, regExp2).Groups[1].Value;
      Console.WriteLine(align);

      string regExp3 = @"<img src=""(.*?)""";
      string img = Regex.Match(html, regExp3).Groups[1].Value;
      Console.WriteLine(img);

      Console.ReadKey();
   }
}

The output, as expected, is:

MY WEBSITE
left
boat.gif


0
darma teja
NA 493 194.2k 11y
Hi Raj,

Thanks for the reply.

Which class I should use to load my html file by giving file path.

Thanks

Darma
0
Raj Bandi
NA 2.5k 288.7k 11y

//MY WEBSITE

var title = document.title;  


//left value, Note: without an id, get p first element, with id use document.getElementById
var align = document.getElementsByTagName("p")[0].getAttribute("align");

//boat.gif
var src = document.getElementsByTagName("img")[0].getAttribute("src");


Call above scripts when dom is ready i,e. either call a function on body load or include above script between </body> and </html> tags(when all the dom elements are processed).

Hope this helps,

Cheers,
Raj