8
Answers

read HTML tag text

darma teja

darma teja

11y
1.1k
1
Hi All,

I would like to get the following output from htm ot html files:

1. MY WEBSITE (Means to get text between TITLE tags)
2. left (Means to align values of "p" tag)
3. boat.gif (Means to src values of "img" tag)


<!DOCTYPE html>
<html>
<HEAD>
<TITLE>MY WEBSITE</TITLE>
<7HEAD>
<body>
<h1>My First Heading</h1>
<p align="left">My first paragraph.</p>
<img src="boat.gif" alt="Big Boat">
</body>
</html>

Thanks Allot in Advance

Darma
Answers (8)
0
Vulpes

Vulpes

NA 98.3k 1.5m 11y
If you have multiple <img> tags, you can do:

 string regExp3 = @"<img src=""(.*?)"""; // as before
 MatchCollection mc = Regex.Matches(html, regExp3);
 foreach(Match m in mc) Console.WriteLine(m.Groups[1].Value);



Accepted
0
darma teja

darma teja

NA 493 194.2k 11y
Hi Vulpes,

If i have more than one image, how can get all images with their path.

Thanks,

Darma


0
Raj Bandi

Raj Bandi

NA 2.5k 288.6k 11y
Hi,

Im not sure why it wasn't working for you. HtmlAgilityPack is well tested and is fast. 

Here is the code 


static void Main(string[] args)
{

var doc = new HtmlDocument();
doc.Load("e:\\temp\\dharma.html");

string align = string.Empty,src = string.Empty,title = string.Empty;

var titleNode = doc.DocumentNode.SelectSingleNode("//title");
if(titleNode != null)
{
title = titleNode.InnerText;
}

var p = doc.DocumentNode.SelectSingleNode("//p[@align]");
if(p != null)
{
align = p.Attributes["align"].Value;
}
var img = doc.DocumentNode.SelectSingleNode("//img[@src]");
if(img != null)
{
src = img.Attributes["src"].Value;
}

Console.WriteLine(title);
Console.WriteLine(align);
Console.WriteLine(src);
}

0
darma teja

darma teja

NA 493 194.2k 11y
Hi Vulpes and Raj,

Thanks for the replies.

@Vulpes: Perfectly, It is working.

@ Raj: First I tried with "HtmlAgilityPack", Unfortunately it was not working. 


Thanks,

Darma
0
Raj Bandi

Raj Bandi

NA 2.5k 288.6k 11y
Hmm,

Sorry my bad incorrectly interpreted, I thought it was on the client side in browser, so you want read this in C#, 

Use HtmlAgilityPack  http://htmlagilitypack.codeplex.com/, add references to your project.

var doc = new HtmlDocment();
doc.load("yourfile.htm");


Use xpath queries something like this
var p = doc.DocumentElement.SelectNodes("//p[@align]");

0
Vulpes

Vulpes

NA 98.3k 1.5m 11y
Here's a different approach using regular expressions.

As I don't know what type of application you're writing, I've used a console application for illustration:

using System;
using System.IO;
using System.Text.RegularExpressions;

class Test
{
   static void Main()
   {
      string html = File.ReadAllText("darma.html");

      string regExp1 = "<TITLE>(.*?)</TITLE>";
      string title = Regex.Match(html, regExp1).Groups[1].Value;
      Console.WriteLine(title);

      string regExp2 = @"<p align=""(.*?)"">";
      string align = Regex.Match(html, regExp2).Groups[1].Value;
      Console.WriteLine(align);

      string regExp3 = @"<img src=""(.*?)""";
      string img = Regex.Match(html, regExp3).Groups[1].Value;
      Console.WriteLine(img);

      Console.ReadKey();
   }
}

The output, as expected, is:

MY WEBSITE
left
boat.gif


0
darma teja

darma teja

NA 493 194.2k 11y
Hi Raj,

Thanks for the reply.

Which class I should use to load my html file by giving file path.

Thanks

Darma
0
Raj Bandi

Raj Bandi

NA 2.5k 288.6k 11y

//MY WEBSITE

var title = document.title;  


//left value, Note: without an id, get p first element, with id use document.getElementById
var align = document.getElementsByTagName("p")[0].getAttribute("align");

//boat.gif
var src = document.getElementsByTagName("img")[0].getAttribute("src");


Call above scripts when dom is ready i,e. either call a function on body load or include above script between </body> and </html> tags(when all the dom elements are processed).

Hope this helps,

Cheers,
Raj