HtmlAgilityPack - DOM and XPath over HTML

| 1 Comment | No TrackBacks

I saw today Josh Christie post about "Better HTML parsing and validation with HtmlAgilityPack".

HtmlAgilityPack is an open source project on CodePlex.  It provides standard DOM APIs and XPath navigation -- even when the HTML is not well-formed!

Well, DOM and XPath over malformed HTML isn't new idea. I've been using XPath when screenscraping HTML for years - it seems to me way more reliable method that regular expressions. All you need in .NET is to read HTML as XML using wonderful SgmlReader from Chris Lovett. SgmlReader is an XmlReader API over any SGML document such as HTML.

But what I don't get is why would anyone (but browser vendors) want to implement DOM and XPath over HTML as is? Reimplementing not-so-simple XML specs over malformed source instead of making it wellformed and using standard API? May be I'm not agile anough but I don't think that's a good idea. I prefer standard proven XML API.

Here is Josh's sample that validates that Microsoft's home page lists Windows as the first item in the navigation sidebar implemented using SgmlReader:

SgmlReader r = new SgmlReader();
r.Href = "http://www.microsoft.com";                        
XmlDocument doc = new XmlDocument();
doc.Load(r);                
//pick the first <li> element in navigation section
XmlNode firstNavItemNode = 
  doc.SelectSingleNode("//div[@id='Nav']//li");
//validate the first list item in the Nav element says "Windows"        
Debug.Assert(firstNavItemNode.InnerText == "Windows"); 
I stay with SgmlReader.

Related Blog Posts

No TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/649

1 Comment

There was a time (The html agility pack was born in 2002) when SgmlReader was not capable of reading "real world" HTML (read: buggy, not following SGML standards, incomplete, etc...), because it was just not designed for this.

I don't know if it is the case know.

As for the "why XPATH directyl on html" question, the answer is simple: the implementation of it avoids XML (or XHTML or any other) conversion. Its faster & consumes less memory, if you're only interested by HTML and not by standards, which is why the html agility pack was designed.

Leave a comment