May 4, 2003

Generating Word documents using XSLT

The world is getting better. And the Word too! Word 2003 Beta2 now understands not only those *.doc files, but XML also. It's all as it should be in open XML world (what makes some people suspicious): there is WordML vocabulary, its schema (well documented one, btw) is available as part of Microsoft Word XML Content Development Kit Beta 2. Having said that it's obvious to go on and to assume that Word documents now may be queried using XPath or XQuery as well as transformed and generated using XSLT. Isn't it fantastic?

So here is "Hello Word!" XSLT stylesheet, which generates minimal, while still valid Word 2003 document:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <xsl:processing-instruction 
name="mso-application">progid="Word.Document"</xsl:processing-instruction>
        <w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/2/wordml">
            <w:body>
                <w:p>
                    <w:r>
                        <w:t>Hello Word!</w:t>
                    </w:r>
                </w:p>
            </w:body>
        </w:wordDocument>
    </xsl:template>
</xsl:stylesheet>
That <?mso-application progid="Word.Document"?> processing instruction is important one - that's how Windows recognizes an XML document as Word document. Seems like they parse only XML document prolog looking for this PI. Good idea I think.

Now let's try something more interesting - transform some XML document to formatted Word document, containing heading, italic text and link. Consider the following source doc:

<?xml-stylesheet type="text/xsl" href="style.xsl"?>
<chapter title="XSLT Programming">
    <para>It's <i>very</i> simple. Just ask <link
url="http://google.com">Google</link>.</para>
</chapter>
Then XSLT stylesheet (quite big one due to verbose element-based WordML syntax):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="http://schemas.microsoft.com/office/word/2003/2/wordml">
    <xsl:template match="/">
        <xsl:processing-instruction 
name="mso-application">progid="Word.Document"</xsl:processing-instruction>
        <w:wordDocument>
            <xsl:apply-templates/>
        </w:wordDocument>
    </xsl:template>
    <xsl:template match="chapter">
        <o:DocumentProperties>
            <o:Title>
                <xsl:value-of select="@title"/>
            </o:Title>
        </o:DocumentProperties>
        <w:styles>
            <w:style w:type="paragraph" w:styleId="Heading3">
                <w:name w:val="heading 3"/>
                <w:pPr>
                    <w:pStyle w:val="Heading3"/>
                    <w:keepNext/>
                    <w:spacing w:before="240" w:after="60"/>
                    <w:outlineLvl w:val="2"/>
                </w:pPr>
                <w:rPr>
                    <w:rFonts w:ascii="Arial" w:h-ansi="Arial"/>
                    <w:b/>
                    <w:sz w:val="26"/>
                </w:rPr>
            </w:style>
            <w:style w:type="character" w:styleId="Hyperlink">
                <w:rPr>
                    <w:color w:val="0000FF"/>
                    <w:u w:val="single"/>
                </w:rPr>
            </w:style>
        </w:styles>
        <w:body>
            <w:p>
                <w:pPr>
                    <w:pStyle w:val="Heading3"/>
                </w:pPr>
                <w:r>
                    <w:t>
                        <xsl:value-of select="@title"/>
                    </w:t>
                </w:r>
            </w:p>
            <xsl:apply-templates/>
        </w:body>
    </xsl:template>
    <xsl:template match="para">
        <w:p>
            <xsl:apply-templates/>
        </w:p>
    </xsl:template>
    <xsl:template match="i">
        <w:r>
            <w:rPr>
                <w:i/>
            </w:rPr>
            <xsl:apply-templates/>
        </w:r>
    </xsl:template>
    <xsl:template match="text()">
        <w:r>
            <w:t xml:space="preserve"><xsl:value-of 
select="."/></w:t>
        </w:r>
    </xsl:template>
    <xsl:template match="link">
        <w:hlink w:dest="{@url}">
            <w:r>
                <w:rPr>
                    <w:rStyle w:val="Hyperlink"/>
                    <w:i/>
                </w:rPr>
                <xsl:apply-templates/>
            </w:r>
        </w:hlink>
    </xsl:template>
</xsl:stylesheet>
And the resulting WordML document, opened in Word 2003:
Generated Word Document

Not bad.

If you need to convert PDF to Word you could discover that many of those converting PDF to Word sites aren't as useful as a dedicated piece of PDF conversion software, most especially complex PDF to Word software for document management.
May 4, 2003 3:42 PM | #Office , #XML
Comments

Ok, I'm closing comments on this page due to severe spamming.

Posted by: Oleg Tkachenko at March 1, 2004 11:13 AM

Interesting to see Microsoft playing catchup. Open Source Office alternative OpenOffice.org http://www.openoffice.org is based on xml and has been around for years.

Posted by: Jez Nicholson at January 23, 2004 6:31 PM

Thanks ! Good work :)

Posted by: Kristopher Gora at December 26, 2003 10:02 AM

Nelson, you need something like /contract/sections/section[@number='section1']/sectionTerm[ @termid='term1']/term

Posted by: Oleg Tkachenko at December 1, 2003 3:10 PM

med, see "Generating images in WordprocessingML" at http://www.tkachenko.com/blog/archives/000106.html

Posted by: Oleg Tkachenko at December 1, 2003 3:05 PM

Hello,
Does anybody know how to get an child node which has an attribute by using selectSingleNode method.

I try to get node "sectionTerm" with attribute termid = "term1" under section which has attribute number="section1" from following
xml file(I have to use [ to replace < because it will not show tag name if I use <):


......
[contract][sections]
[section number="section1"]
[sectionTerm termid = "term1"]
[term]Hello[/term]
[/sectionTerm]
[sectionTerm termid = "term2"]
[term]Goodbye[/term]
[/sectionTerm]
[/section]

[section number="section2"]
[sectionTerm termid = "term1"]
[term]Hello[/term]
[/sectionTerm]
[sectionTerm termid = "term2"]
[term]Goodbye[/term]
[/sectionTerm]
[/section]
[/sections]
[/contract]

Posted by: Nelson Xu at November 30, 2003 9:29 PM

I'm a newbie in WordML, how do you handle images?

Posted by: med at November 30, 2003 9:18 PM

This is pretty interesting. I agree with the author.

Posted by: dns at October 12, 2003 2:02 PM

Cris, afaik Word 2003 holds images embedded within WordML document, obviously Base64 encoded. It's w:pict element, take a look into WordML schema. So it also seems to be quite feasible.

Posted by: Oleg Tkachenko at July 13, 2003 7:58 PM

Yeah, sure I've been thinking about XSL-FO2WordML and WordML2XSL-FO, but I'm still in research phase. While I know XSL-FO well, I'm newbie in WordML.
But that's really sounds tempting...

Posted by: Oleg Tkachenko at July 13, 2003 7:20 PM

using XSL:FO as unified formatting language for documents, can any WordML be transformed to FO and can any FO be transformed to WordNL, in other words, is there (semantic, or functional, whatever that means in formal terms, I am not 100% sure) equivalence between two formatting languages?

I don't know that, did you think of that already? I think definite answer requires some time consuming research...

Posted by: viktor gritsenko at July 13, 2003 7:05 PM

how do you handle images and making them local images so users can edit images and see them if internet connection is not available.

Posted by: cris at July 10, 2003 8:53 PM

Very Cool,

i will wait until more tools are avaible!

Thanks for the info, Oleg.

Hans Braumller
-- + --
Mail Art Networking Visual & Virtual Poet
http://braumueller.crosses.net

Posted by: Hans Braumller at May 20, 2003 10:47 AM

Oh, Goggle, funny typo, thanks, fixed. btw, goggle.com site does exist, but I don't advise to browse it due to nasty spam popup windows.
And what about Word - I do impressed about these new possibility also. Let's just wait the release and when people get upgraded.

Posted by: Oleg Tkachenko at May 5, 2003 1:46 PM

Wow. I wish I understood that! It seems to be one of the holy grails, producing a valid word document *without* using word :-)

You do know you wrote Goggle, right?

Posted by: Dan F at May 5, 2003 1:25 PM

Comments on this post are closed, sorry...

Trackback Pings

Listed below are links to weblogs that reference this post:

Generating Word documents using XSLT from Brad's Blog
Tracked on May 5, 2003 7:41 PM

Todays links from InsultConsult
Tracked on May 7, 2003 4:13 PM

Generating Word documents using XSLT from Liudvikas Bukys
Tracked on May 8, 2003 2:35 PM

XML - Interneti - Mail Art from zzzzzzzzzzzzzzzz
Tracked on June 25, 2003 7:56 PM

Signs on the Sand: Generating Word documents using XSLT from Roland Tanglao's Weblog
Tracked on August 18, 2003 5:50 PM

RE: Let's talk t, p, and r from John R. Durant's WebLog
Tracked on February 11, 2004 6:47 PM

More on Word and XML from Steven's [Mostly] Tech Notebook
Tracked on July 9, 2004 5:02 PM

re: Generating Word documents with XML and XSLT from B# .NET Blog
Tracked on September 4, 2004 6:29 PM