March 17, 2004

Transforming WordML to HTML: Support for Images

Update: this post is outdated, see "WordML2HTML with support for images stylesheet updated" for updates. Here is a new version of WordML2HTML XSLT stylesheet, developed by Microsoft for Word 2003 Beta2 and adapted by me to Word 2003 RTM. I called this version "1.1-.NET-script". Here is why. Along with some ...

Word 2003 allows to save document as "Single File Web Page" (*.mht aka Web Archive file) or as usual HTML document. In the latter case all images embedded into Word document are saved into documentName_files directory. Here is a challenge - how to implement it with XSLT. Obviously we need extension function to decode (Base64) embedded image and to write to a directory. That's not a big deal, but the problem is it makes XSLT stylesheet not portable, that's why I need different versions for .NET, MSXML etc. Second problem - WordML document doesn't store name of file, so the question is how to name a directory where to save decoded images. I introduced a global stylesheet parameter called docName to facilitate the issue. If docName parameter isn't provided, it's defaulted to first 10 characters of the document title.

To run the transformation I used nxslt.exe command line utility. Download it for free if you don't have it yet.

So I created test Word 2003 document with a couple of images:
Test Word document with a couple of images
Saved it as XML to test.xml file and run transformation to HTML by the following command line:

nxslt.exe test.xml d:\xsl\Word2HTML-.NET-script.xsl -o test.html docName=test
As the result, XSLT transformation created test.html document and test_files directory, containing two decoded images, here is how it looks like in a browser:
Word document transformed into HTML.

The implementation is very simple one. Here it is:

<msxsl:script language="c#" implements-prefix="ext">
  public string decodePicture(XPathNodeIterator bindata, string dirname, string filename) {
    if (bindata.MoveNext()) {
      System.IO.DirectoryInfo di = new System.IO.DirectoryInfo(dirname);
      if (!di.Exists)
        di.Create();
      using (System.IO.FileStream fs = 
        System.IO.File.Create(System.IO.Path.Combine(di.FullName, filename))) {
        byte[] data = Convert.FromBase64String(bindata.Current.Value);
        fs.Write(data, 0, data.Length);
      }
      return dirname + "/" + filename;
    }
    else 
        return "";
}
</msxsl:script>
<xsl:template match="w:pict">
  <xsl:variable name="dir">
    <xsl:choose>
      <xsl:when test="$docName != ''">
        <xsl:value-of select="$docName"/>
      </xsl:when>
      <xsl:otherwise>
        <!-- We need something unique instead of document name -->
        <!-- Let's take first 10 characters of title -->
        <xsl:value-of select="translate(substring($p.docInfo/o:Title, 1, 10), ' ', '')"/>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:text>_files</xsl:text>		
  </xsl:variable>
  <img 
  src="{ext:decodePicture(w:binData, $dir, substring-after(w:binData/@w:name, 'wordml://'))}" 
  alt="{v:shape/v:imagedata/@o:title}" style="{v:shape/@style}" 
  title="{v:shape/v:imagedata/@o:title}"/>
</xsl:template>
Not a rocket engineering indeed. Yes, Sal, WMZ images are not supported, I have no idea how to convert them to GIF.

Download the stylesheet here and give it a shot. Again - this stylesheet requires .NET XSLT engine. Any comments/requests/bug reports are welcome.