Transforming WordML to HTML: Support for Images

| 20 Comments | 2 TrackBacks

Update: this post is outdated, see "WordML2HTML with support for images stylesheet updated" for updates.

Here is a new version of WordML2HTML XSLT stylesheet, developed by Microsoft for Word 2003 Beta2 and adapted by me to Word 2003 RTM. I called this version "1.1-.NET-script". Here is why. Along with some bug fixes (typo with w:rStyle, empty <title> in generated HTML etc) I implemented basic support for images. That required XSLT extension function, which I implemented with .NET and <msxsl:script>. MHT and MSXML/Jscript versions are coming soon.

Word 2003 allows to save document as "Single File Web Page" (*.mht aka Web Archive file) or as usual HTML document. In the latter case all images embedded into Word document are saved into documentName_files directory. Here is a challenge - how to implement it with XSLT. Obviously we need extension function to decode (Base64) embedded image and to write to a directory. That's not a big deal, but the problem is it makes XSLT stylesheet not portable, that's why I need different versions for .NET, MSXML etc. Second problem - WordML document doesn't store name of file, so the question is how to name a directory where to save decoded images. I introduced a global stylesheet parameter called docName to facilitate the issue. If docName parameter isn't provided, it's defaulted to first 10 characters of the document title.

To run the transformation I used nxslt.exe command line utility. Download it for free if you don't have it yet.

So I created test Word 2003 document with a couple of images:
Test Word document with a couple of images
Saved it as XML to test.xml file and run transformation to HTML by the following command line:

nxslt.exe test.xml d:\xsl\Word2HTML-.NET-script.xsl -o test.html docName=test
As the result, XSLT transformation created test.html document and test_files directory, containing two decoded images, here is how it looks like in a browser:
Word document transformed into HTML.

The implementation is very simple one. Here it is:

<msxsl:script language="c#" implements-prefix="ext">
  public string decodePicture(XPathNodeIterator bindata, string dirname, string filename) {
    if (bindata.MoveNext()) {
      System.IO.DirectoryInfo di = new System.IO.DirectoryInfo(dirname);
      if (!di.Exists)
        di.Create();
      using (System.IO.FileStream fs = 
        System.IO.File.Create(System.IO.Path.Combine(di.FullName, filename))) {
        byte[] data = Convert.FromBase64String(bindata.Current.Value);
        fs.Write(data, 0, data.Length);
      }
      return dirname + "/" + filename;
    }
    else 
        return "";
}
</msxsl:script>
<xsl:template match="w:pict">
  <xsl:variable name="dir">
    <xsl:choose>
      <xsl:when test="$docName != ''">
        <xsl:value-of select="$docName"/>
      </xsl:when>
      <xsl:otherwise>
        <!-- We need something unique instead of document name -->
        <!-- Let's take first 10 characters of title -->
        <xsl:value-of select="translate(substring($p.docInfo/o:Title, 1, 10), ' ', '')"/>
      </xsl:otherwise>
    </xsl:choose>
    <xsl:text>_files</xsl:text>		
  </xsl:variable>
  <img 
  src="{ext:decodePicture(w:binData, $dir, substring-after(w:binData/@w:name, 'wordml://'))}" 
  alt="{v:shape/v:imagedata/@o:title}" style="{v:shape/@style}" 
  title="{v:shape/v:imagedata/@o:title}"/>
</xsl:template>
Not a rocket engineering indeed. Yes, Sal, WMZ images are not supported, I have no idea how to convert them to GIF.

Download the stylesheet here and give it a shot. Again - this stylesheet requires .NET XSLT engine. Any comments/requests/bug reports are welcome.

Related Blog Posts

2 TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/197

You have been Taken Out! Comments about your post on this link. Thanks! Read More

Many people have asked me if there is an easy way to go from Word XML into XHTML. I've already mentioned... Read More

20 Comments

Hi,
The HTMl file generated from the nxslt is not compatible when rendered in IE8.

Please give me some solution / suggstion

suppose instead of specifting the path, image contents are there in xml file how can i retrieve the image back in doc file? Reply to my mail saravanan_article@yahoo.com and also tell me how to run the application is it req to write xsltransformation for that?

Silly bug, Thomas. bindata argument must be moved next before it can be used. Somehow it works in .NET 1.1, but not in .NET 2.0.
I fixed this post and stylesheet.

Super work!!! But why doesn't it function with nxslt2?

Firstly - great job - this is absolutely brilliant!
Is there any way to get it to convert header images?

Thanks

Fergal

Hello Oleg!
I have WordML file and also problem with using your script. In my XML I have two images - during processing secod image I always get next error: System.UnauthorizedAccessException: Access to the path... is denied. It's happened always with second image. I've cahnged your code so putput directory will change for next image - it still the same error - - the directory will be created but after that I 've ogt exception Any ideas?
Thanks
Alexander

Nope, sorry, too busy currently.

Any new solutions on wmz images?

Sorry I made a mistake

Here is the valid XSL




alt="{v:shape/v:imagedata/@o:title}" style="{v:shape/@style}" title="{v:shape/v:imagedata/@o:title}"/>


Note that I pass the base64 image in session to avoid writing to disk.

Now the last file, "show_image.php"

This is great when you use the .NET framework on a Windows platform (or Mono on any platform).

Here is a solution to do that in PHP5.

I have 3 files. word2html.xsl wordml_preview_html.php and show_image.php

In word2xsl put:

-->


This will call the PHP function "base64IMG". Now in you PHP code for the "wordml_preview_html.php":

/* Code to use the XSLT */
$xsl = new DOMDocument;
$xsl->load('xsl/word2html.xsl');
$proc = new XSLTProcessor;
$proc->registerPHPFunctions();
$proc->importStyleSheet($xsl); // attach the xsl rules
$dom = $proc->transformToDoc($dom);
// $dom is the DomDocument where I loaded the WordML XML data

/* function that is called */
function base64IMG($binData, $filename){
$_SESSION['image_test']['binData'] = $binData;
return 'utils/show_image.php?type=jpg';
}

Well, it's access denied error, which is obviously has nothing to do with XSLT. How do you run XSLT? Looks like it's trying to create image on disk and security doesn't allow that.

Hi there,

Sorry about the delayed response. To reproduce the problem, just place an image into the header section of a Word document. doesn't seem to matter if its JPEG or GIF.

Stephajn, how can I reproduce that problem?

Very handy. Images don't seem to be picked up if they're in the header section of a WordML document though.

Well, sounds like broken WordML document? That's insteresting to see, can you send it to me?

Oleg,

I'm getting this exception in your 'ext:decodePicture' function:

System.Xml.Xsl.XsltException: Function 'ext:decodePicture()' has failed. ---> Sy
stem.Reflection.TargetInvocationException: Exception has been thrown by the targ
et of an invocation. ---> System.FormatException: Invalid character in a Base-64
string.
at System.Convert.FromBase64String(String s)
at Microsoft.Xslt.CompiledScripts.CSharp.ScriptClass_1.decodePicture(XPathNod
eIterator bindata, String dirname, String filename)

Great work! Thank you.

very cool. Kudos

Leave a comment