Word 2003 allows to save document as "Single File Web Page" (*.mht aka Web Archive file) or as usual HTML document. In the latter case all images embedded into Word document are saved into documentName_files directory. Here is a challenge - how to implement it with XSLT. Obviously we need extension function to decode (Base64) embedded image and to write to a directory. That's not a big deal, but the problem is it makes XSLT stylesheet not portable, that's why I need different versions for .NET, MSXML etc. Second problem - WordML document doesn't store name of file, so the question is how to name a directory where to save decoded images. I introduced a global stylesheet parameter called docName to facilitate the issue. If docName parameter isn't provided, it's defaulted to first 10 characters of the document title.
To run the transformation I used nxslt.exe command line utility. Download it for free if you don't have it yet.
So I created test Word 2003 document with a couple of images:
Saved it as XML to test.xml file and run transformation to HTML by the following command line:
nxslt.exe test.xml d:\xsl\Word2HTML-.NET-script.xsl -o test.html docName=testAs the result, XSLT transformation created test.html document and test_files directory, containing two decoded images, here is how it looks like in a browser:
The implementation is very simple one. Here it is:
<msxsl:script language="c#" implements-prefix="ext"> public string decodePicture(XPathNodeIterator bindata, string dirname, string filename) { if (bindata.MoveNext()) { System.IO.DirectoryInfo di = new System.IO.DirectoryInfo(dirname); if (!di.Exists) di.Create(); using (System.IO.FileStream fs = System.IO.File.Create(System.IO.Path.Combine(di.FullName, filename))) { byte[] data = Convert.FromBase64String(bindata.Current.Value); fs.Write(data, 0, data.Length); } return dirname + "/" + filename; } else return ""; } </msxsl:script> <xsl:template match="w:pict"> <xsl:variable name="dir"> <xsl:choose> <xsl:when test="$docName != ''"> <xsl:value-of select="$docName"/> </xsl:when> <xsl:otherwise> <!-- We need something unique instead of document name --> <!-- Let's take first 10 characters of title --> <xsl:value-of select="translate(substring($p.docInfo/o:Title, 1, 10), ' ', '')"/> </xsl:otherwise> </xsl:choose> <xsl:text>_files</xsl:text> </xsl:variable> <img src="{ext:decodePicture(w:binData, $dir, substring-after(w:binData/@w:name, 'wordml://'))}" alt="{v:shape/v:imagedata/@o:title}" style="{v:shape/@style}" title="{v:shape/v:imagedata/@o:title}"/> </xsl:template>Not a rocket engineering indeed. Yes, Sal, WMZ images are not supported, I have no idea how to convert them to GIF.
Download the stylesheet here and give it a shot. Again - this stylesheet requires .NET XSLT engine. Any comments/requests/bug reports are welcome.
The main scenario when IndexingXPathNavigator is meant to be used is uniform repetitive XPath selections from loaded in-memory XML document, such as selecting orders by orderID from an XmlDocument. Using IndexingXPathNavigator with preindexed selections allows drastically decrease selection time and to achieve O(n) perf.
After all keys are declared, IndexingXPathNavigator is ready for indexing. Indexing process is performed as follows - each node in XML document is matched against all key definitions. For each matching node, key value is calculated and this node-value pair is added into appropriate Hashtable. As can be seen indexing is not a cheap operation, it involves walking through the whole XML tree, multiple node matching and XPath expression evaluating. That's the usual indexing price. Indexing can be done in either lazy (first access time) or eager (before any selections) manner.
After indexing IndexingXPathNavigator is ready for node retrieving. IndexingXPathNavigator augments XPath with standard XSLT's key(string keyname, object keyValue) function, which allows to retrieve nodes directly from built indexes (Hashtables) by key value. The function is implemented as per XSLT spec.
<Item> <OrderID> 10952</OrderID> <OrderDate> 4/15/96</OrderDate> <ShipAddress> Obere Str. 57</ShipAddress> </Item>The aim is to select shipping address for an order by order ID. Here is how it can be implemented with IndexingXPathNavigator:
XPathDocument doc = new XPathDocument("test/northwind.xml"); IndexingXPathNavigator inav = new IndexingXPathNavigator( doc.CreateNavigator()); //Declare a key named "orderKey", which matches Item elements and //whose key value is value of child OrderID element inav.AddKey("orderKey", "OrderIDs/Item", "OrderID"); //Indexing inav.BuildIndexes(); //Selection XPathNodeIterator ni = nav.Select("key('orderKey', ' 10330')/ShipAddress"); while (ni.MoveNext()) Console.WriteLine(ni.Current.Value);
Loading XML document: 167.12 ms Regular selection, warming... Regular selection: 1000 times, total time 5371.79 ms, 1000 nodes selected Regular selection, testing... Regular selection: 1000 times, total time 5181.80 ms, 1000 nodes selected Building IndexingXPathNavigator: 1.03 ms Adding keys: 5.16 ms Indexing: 58.21 ms Indexed selection, warming... Indexed selection: 1000 times, total time 515.90 ms, 1000 nodes selected Indexed selection, testing... Indexed selection: 1000 times, total time 476.06 ms, 1000 nodes selectedAs can be seen, average selection time for regular XPath selection is 5.181 ms, while for indexed selection it's 0.476 ms. One order of magnitude faster! Note additionally that XML document is very simple and regular and I used /ROOT/CustomerIDs/OrderIDs/Item[OrderID=' 10330']/ShipAddress XPath for regular selection, which is almost linear search and is probably the most effective from XPath point of view. With more complex XML structure and XPath expressions such as //Item[OrderID=' 10330']/ShipAddress the difference would be even more striking.
Full source code along with perf testing. As usual two download locations available: local one and from GotDotNet (will update later). IndexingXPathNavigator homepage is http://www.tkachenko.com/dotnet/IndexingXPathNavigator.html.
I really like this one. Probably that's what my next article is going to be about. What do you think? I'm cap in hand waiting for comments.