December 12, 2006

HtmlAgilityPack - DOM and XPath over HTML

I saw today Josh Christie post about "Better HTML parsing and validation with HtmlAgilityPack".

HtmlAgilityPack is an open source project on CodePlex.  It provides standard DOM APIs and XPath navigation -- even when the HTML is not well-formed!

Well, DOM and XPath over malformed HTML isn't new idea. I've been using XPath when screenscraping HTML for years - it seems to me way more reliable method that regular expressions. All you need in .NET is to read HTML as XML using wonderful SgmlReader from Chris Lovett. SgmlReader is an XmlReader API over any SGML document such as HTML.

But what I don't get is why would anyone (but browser vendors) want to implement DOM and XPath over HTML as is? Reimplementing not-so-simple XML specs over malformed source instead of making it wellformed and using standard API? May be I'm not agile anough but I don't think that's a good idea. I prefer standard proven XML API.

Here is Josh's sample that validates that Microsoft's home page lists Windows as the first item in the navigation sidebar implemented using SgmlReader:

SgmlReader r = new SgmlReader();
r.Href = "http://www.microsoft.com";                        
XmlDocument doc = new XmlDocument();
doc.Load(r);                
//pick the first <li> element in navigation section
XmlNode firstNavItemNode = 
  doc.SelectSingleNode("//div[@id='Nav']//li");
//validate the first list item in the Nav element says "Windows"        
Debug.Assert(firstNavItemNode.InnerText == "Windows"); 
I stay with SgmlReader. ...