On transforming WordML to HTML again

| 11 Comments | 1 TrackBack

One of consequences of the revolutionary XML support in Microsoft Office 2003 is a possibility to unlock information in the Microsoft Office System using XML. Most likely that was deliberate decision to open Office doors for XML technology and I'm sure that's winning strategy.

Talking about transforming WordprocessingML (WordML) to HTML, what's the state of the art nowadays?
There are two related activities I'm aware of, both Microsoft rooted. First, it's "WordML to HTML XSL Transformation" XSLT stylesheet available for download at Microsoft Download Center. It's huge while well documented while unsupported beta XSLT stylesheet, which transforms Word 2003 Beta 2 XML documents to HTML. Its final release, which will also support images is expected, but who knows when?
Second, Don Box is experimenting with Wordml2XHTML+CSS transformation, mostly for the sake of his blogging workflow. He said his stylesheet is better (less global variables etc.). Apparently Don didn't finish it yet, so the stylesheet isn't available.

So one stylesheet is only for Word 2003 Beta 2 documents, second isn't ready yet, sounds bad, huh? Here is my temporary solution - original "WordML Beta 2 to HTML XSL Transformation" stylesheet fixed by me to support Word 2003 RTM XML documents. As usually with Microsoft stuff, "beta" most likely is 99% RTM version. So I fixed Beta 2 stylesheet a bit and it just works. In fact that's only namespaces that I fixed yet. I'm currently testing the stylesheet with big real documents, so chances are I'll need to modify it further.

Download version 1.0 of the stylesheet here - Word2HTML-1.0.zip. Credits due to Microsoft and personally to whoever developed the stylesheet. Any bug reports or comments are appreciated. Just post comment to this text.

Another idea is to implement support for images. Basically the idea is to decode images and save them as external files in XSLT external function and I don't see how to make it in portable way, so most likely I'll end up soon with two stylesheet versions - for MSXML and .NET. Stay tuned.

Related Blog Posts

1 TrackBack

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/155

Here is a new version of WordML2HTML XSLT stylesheet, developed by Microsoft for Word 2003 Beta2 and adapted by me to Word 2003 RTM. I called this version "1.1-.NET-script". Here is why. Along with some bug fixes (typo with w:rStyle, empty <title> i... Read More

11 Comments

I´m seaching for an XSL to transform HTM into Wordml ¿any idea? thanks.

Oleg, have you done a version of XSL which can convert the WMZ into HTML? If so, I am very interested in knowing how, as I have to do this, and quite stuck now. Anyone any ideas?

Hi Oleg,

I don't know if you've seen it, but MS has released the updated transformations. They actually package it as a viewer that will take the WordML and render it in IE. You can also see the XSL file separately after you install the download. However, you need to render through IE to see the images.

http://www.microsoft.com/downloads/details.aspx?familyid=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en

- Sal

WMZ is a compressed windows metafile. I've seen WordML where the contained image data is the base64 encoded version of a compressed WMZ.
Interestingly, when you save a Word Doc as HTML, any wmz files are also exported as gif files.
However, the WordML does not contain the gif data, just the wmz.

- Sal

No much free time these days :( Must finish EXSLT.NET article first.

What's wmz format?

Hi Oleg,

Any progress on the image support in the WordML to HTML transformation? Will you be supporting a conversion of .wmz files?

Thanks,
Sal

Thanks for your comment, Sal!
Fixed that. Next week I hope to publish next version with suppprt for images.

Hi Oleg,

Sorry, my previous post didn't escape the <>, it should have been:


In the template, the variable rStyleId has a minor typo, it should be


(Note the /w:rStyle, not /wrStyle.)

Thanks,
Sal

Hi Oleg,

Great job.

I found one error. In the , the variable rStyleId has a minor typo, it should be

(Note the /w:rStyle, not /wrStyle.)

Thanks,
Sal

Yeah, I'm mostly interested in XML2WordML too, but many want an opposite conversion.
For instance I've been implementing some system recently, where one can search in WordML document library and then view the results in browser with search keywords highlighted.
Piece of cake using mentioned XSLT stylesheet!

Hi Oleg,

Frankly, I don't understand what's the purpose of transforming Word to HTML -- why not simply Save As ... and choose html format?

What seems very attractive for me is to be able to transform an xml document (or a set of xml documents) with a known Schema/DTD to WordML -- that is to Word.

The multitude of uses for this is immediately obvious.


Cheers,
Dimitre.

Leave a comment