July 22, 2004

How to get ASCII encoded XML document while writing arbitrary unicode data (e.g. when transforming with XslTransform class)

Sometimes some of us want to narrow encoding of an output XML document, while to preserve data fidelity. E.g. you transform some XML source with arbitrary Unicode text into another format and you need the resulting XML document to be ASCII encoded (don't ask me why). Here is fast and ...

One could ask - btw, can ASCII encoded XML document to contain arbitrary Unicode characters? While it sounds a bit contradictory, the answer is of course - sure it can, because encoding of an XML document (that one which you can see in XML declaration) is sort of "transfer encoding", while character range for any XML document is always the whole set of legal characters of Unicode and ISO/IEC 10646 (more strict definition). XML syntax allows any (but legal in XML) character to be written as numeric character reference, e.g. ב (Hebrew letter BET). So the solution for the given problem of narrowing the encoding is to encode all characters that don't fit into target encoding as numeric character references. Basically that's what XSLT 1.0 spec requires from XSLT processors:

It is possible that the result tree will contain a character that cannot be represented in the encoding that the XSLT processor is using for output. In this case, if the character occurs in a context where XML recognizes character references (i.e. in the value of an attribute node or text node), then the character should be output as a character reference; otherwise (for example if the character occurs in the name of an element) the XSLT processor should signal an error.

Well, unfortunately native .NET 1.X XSLT processor - XslTransform class doesn't support that yet. So let's see how we can get this done. The first solution that comes in my mind is simple custom XmlWriter, which filters output text and encodes all non ASCII characters as numeric character references. Just like SAX filter for those SAX minded. Here is the implementation:

public sealed class ASCIIXmlTextWriter : XmlTextWriter 
{
  //Constructors - add more as needed
  public ASCIIXmlTextWriter(string url) :
    base(url, Encoding.ASCII) {}

  public override void WriteString(string text)
  {
    StringBuilder sb = new StringBuilder(text.Length);
    foreach (char c in text)
    {
      if (c > 0x0080)
      {
        sb.Append("&#");
        sb.Append((int)c);
        sb.Append(';');
      }
      else
      {
        sb.Append(c);
      }
    }
    base.WriteRaw(sb.ToString());
  }
}
Let's test it. Source XML:
<?xml version="1.0" encoding="UTF-8"?>
<message>English, Русский (Russian), עברית (Hebrew)</message>
Dummy stylesheet:
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <xsl:copy-of select="message"/>
  </xsl:template>
</xsl:stylesheet>
The code:
XPathDocument doc = new XPathDocument("foo.xml");
XslTransform xslt = new XslTransform();
xslt.Load("foo.xsl");
ASCIIXmlTextWriter writer = new ASCIIXmlTextWriter("out.xml");
xslt.Transform(doc, null, writer, null);
writer.Close();
And here is the result (note that all non ASCII charcters have been written as numeric char references):
<message>English, &#1056;&#1091;&#1089;&#1089;&#1082;&#1080;&#1081; 
(Russian), &#1506;&#1489;&#1512;&#1497;&#1514; (Hebrew)</message>
Which looks fine both in IE and Mozilla:

Above simple solution handles equally well both text and attribute values, but not comments and PIs (but that's easy as well). It's neither optimized nor tested, but you've got the idea.

Justification of XHTML

W3C has published "HTML and XHTML FAQ" document. "Why is XHTML needed? Isn't HTML good enough?", "What are the advantages of using XHTML rather than HTML?. Rather interesting refresh WRT to recent discussion in xml-dev list. ...

Small but cool

Isn't it cool to have a small personal page at microsoft.com? :) Every MVP got such one recently. Here is mine (aka http://aspnet2.com/mvp.ashx?olegt). And here is the XML MVPs gang. ...