June 2004 Archives

Daniel implemented SubtreeXPathNavigator I was talking about. That's a way cool stuff, I really like it. Now I'm not sure about XmlNodeNavigator - do we need it in Mvp.Xml library or we better remove it to not confuse users with different forms of the same navigator?

I feel a bit guilty about Mvp.Xml project and the June plans I announced. Sorry, I just did nothing. I'm way busy with another unexpected stuff I just had to finish first.

Microsoft Research RSS Feeds

| No Comments | 5 TrackBacks |

Here is another interesting puzzle to solve - how would you validate Doctype-less XML document (which has no Doctype declaration) against DTD?

Non-Extractive XML Parsing

| No Comments | No TrackBacks |

Well, I'm working on decreasing the size of the "Items for Read" folder in RSS Bandit. Still many to catch up, but anyway. XML.com has published "Non-Extractive Parsing for XML" article by Jimmy Zhang. In the article Jimmy proposes another approach to XML parsing - using "non-extractive" style of tokenization. In simple words his idea is to treat XML document as a static sequence of characters, where each XML token can be identified by the offset:length pair of integers. That would give lots of new possibilities such as updating a part of XML document without serializing of unchanged content (by copying only leading and trailing buffers of text instead), fast addressing by offset instead of ids or XPath, creating binary index for a document ("parse once, use many times" approach).

While sounding interesting (and not really new as being sort of remake of the idea of parsing XML by regexp) there is lots of problems with "non-extractive" parsing. XML in general doesn't really fit well into that paradigm. Entities and inclusions, encoding issues, comments, CDATA and default values in DTD all screw up the idea. Unfortunately that happens with optimization techniques quite often - they tend to simplify the problem. It probably will work only with a very limited subset of XML, but it's fruitfullness still needs to be proven.

Another shortcoming of "non-extractive" parsing is the necessity to have entire source XML document accessible (obviously offsets are meaningless with no source buffer at hands). That would mean the buffering the whole (possibly huge) XML document in a streaming scenario (e.g. when you read XML from a network stream).

Still that was interesting reading. Indexing of an XML document, how does it sound? Using IndexingXPathNavigator it's possible to index in-memory IXPathNavigable XML store and to select nodes directlty by key values instead of traversing the tree. That works, but there is still lots of room for developement here. What about persistent indexes? What if XslTransform would be able to leverage existing indexes instead of building its own (for xsl:key) on each transformation?

Say you've got a DataSet and you want to save its data as XML. DataSet.WriteXml() method does it perfectly well. But what if you need saved XML to have a reference to an XSLT stylesheet - xml-stylesheet processing instruction, such as <?xml-stylesheet type="text/xsl" href="foo.xsl"?> ? Of course you can load saved XML into XmlDocument, add the PI and then save it back, but don't you remember Jan Gray's Performance Pledge we took:

"I promise I will not ship slow code. Speed is a feature I care about. Every day I will pay attention to the performance of my code. I will regularly and methodically measure its speed and size. I will learn, build, or buy the tools I need to do this. It's my responsibility."
Come on, forget about XmlDocument. Think about perf and don't be lazy to look for a performance-oriented solution in the first place. Here is one simple streaming solution to the problem - small customized XmlWriter, which adds the PI on the fly during XML writing.

RSS makes its way. TechNet's security team announced the first version of an RSS feed for its security bulletins: Microsoft Security Bulletin RSS Feed.

Reading wonderful "Chapter 9 - Improving XML Performance":

Split Complex Transformations into Several Stages You can incrementally transform an XML document by using multiple XSLT style sheets to generate the final required output. This process is referred to as pipelining and is particularly beneficial for complex transformations over large XML documents.

More Information

For more information about how to split complex transformations into several stages, see Microsoft Knowledge Base article 320847, "HOW TO: Pipeline XSLT Transformations in .NET Applications," at http://support.microsoft.com/default.aspx?scid=kb;en-us;320847.
Sounds great, but referred KB article 320847 is actually a huge green fly in the ointment - it still suggests using temporary MemoryStream to pipeline XSL transformations! What a crap, didn't they hear about new XPathDocument(xslt.Transform(doc, args))??? I've reported that glitch more than a year ago and I've been told they are working to fix it. Still not fixed. Ok, probably that's time to use my MVP connections to get it finally fixed.

Another one I've stumbled upon:

XPathNavigator. The XPathNavigator abstract base class provides random read - only access to data through XPath queries over any data store. Data stores include the XmlDocument, DataSet, and XmlDataDocument classes.
Somehow "the XML data store" XPathDocument is forgotten once again :(

I'm back

| 2 Comments | No TrackBacks |

So I'm back. That was crazy trip Tel-Aviv-Prague-Berlin-Amsterdam-Paris-Bavaria-Prague-Tel-Aviv. Bad weather was chasing us, but fortunately it was mostly warm enough even for us sun-accustomed Israelis.

Mailbox overflow did happen and all incoming mail has been bounced during 06/01-06/03. If you were trying to send me something that days, you may want to resend it again. I just started mailbox cleaning (3000 unreaded / 70% spam). And RSS Bandit (600 unreaded / 0% spam :) is waiting for me too. I just removed about 100 spam comments from the blog. Recovering's going on...