July 22, 2004

How to get ASCII encoded XML document while writing arbitrary unicode data (e.g. when transforming with XslTransform class)

Sometimes some of us want to narrow encoding of an output XML document, while to preserve data fidelity. E.g. you transform some XML source with arbitrary Unicode text into another format and you need the resulting XML document to be ASCII encoded (don't ask me why). Here is fast and ...

One could ask - btw, can ASCII encoded XML document to contain arbitrary Unicode characters? While it sounds a bit contradictory, the answer is of course - sure it can, because encoding of an XML document (that one which you can see in XML declaration) is sort of "transfer encoding", while character range for any XML document is always the whole set of legal characters of Unicode and ISO/IEC 10646 (more strict definition). XML syntax allows any (but legal in XML) character to be written as numeric character reference, e.g. ב (Hebrew letter BET). So the solution for the given problem of narrowing the encoding is to encode all characters that don't fit into target encoding as numeric character references. Basically that's what XSLT 1.0 spec requires from XSLT processors:

It is possible that the result tree will contain a character that cannot be represented in the encoding that the XSLT processor is using for output. In this case, if the character occurs in a context where XML recognizes character references (i.e. in the value of an attribute node or text node), then the character should be output as a character reference; otherwise (for example if the character occurs in the name of an element) the XSLT processor should signal an error.

Well, unfortunately native .NET 1.X XSLT processor - XslTransform class doesn't support that yet. So let's see how we can get this done. The first solution that comes in my mind is simple custom XmlWriter, which filters output text and encodes all non ASCII characters as numeric character references. Just like SAX filter for those SAX minded. Here is the implementation:

public sealed class ASCIIXmlTextWriter : XmlTextWriter 
{
  //Constructors - add more as needed
  public ASCIIXmlTextWriter(string url) :
    base(url, Encoding.ASCII) {}

  public override void WriteString(string text)
  {
    StringBuilder sb = new StringBuilder(text.Length);
    foreach (char c in text)
    {
      if (c > 0x0080)
      {
        sb.Append("&#");
        sb.Append((int)c);
        sb.Append(';');
      }
      else
      {
        sb.Append(c);
      }
    }
    base.WriteRaw(sb.ToString());
  }
}
Let's test it. Source XML:
<?xml version="1.0" encoding="UTF-8"?>
<message>English, Русский (Russian), עברית (Hebrew)</message>
Dummy stylesheet:
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <xsl:copy-of select="message"/>
  </xsl:template>
</xsl:stylesheet>
The code:
XPathDocument doc = new XPathDocument("foo.xml");
XslTransform xslt = new XslTransform();
xslt.Load("foo.xsl");
ASCIIXmlTextWriter writer = new ASCIIXmlTextWriter("out.xml");
xslt.Transform(doc, null, writer, null);
writer.Close();
And here is the result (note that all non ASCII charcters have been written as numeric char references):
<message>English, &#1056;&#1091;&#1089;&#1089;&#1082;&#1080;&#1081; 
(Russian), &#1506;&#1489;&#1512;&#1497;&#1514; (Hebrew)</message>
Which looks fine both in IE and Mozilla:

Above simple solution handles equally well both text and attribute values, but not comments and PIs (but that's easy as well). It's neither optimized nor tested, but you've got the idea.

Justification of XHTML

W3C has published "HTML and XHTML FAQ" document. "Why is XHTML needed? Isn't HTML good enough?", "What are the advantages of using XHTML rather than HTML?. Rather interesting refresh WRT to recent discussion in xml-dev list. ...

Small but cool

Isn't it cool to have a small personal page at microsoft.com? :) Every MVP got such one recently. Here is mine (aka http://aspnet2.com/mvp.ashx?olegt). And here is the XML MVPs gang. ...

July 21, 2004

XML Schema 1.1, First Working Draft

Oh boy! 2004-07-19: The XML Schema Working Group has released the First Public Working Draft of XML Schema 1.1 in two parts: Part 1: Structures and Part 2: Datatypes. The drafts include change logs from the XML Schema 1.0 language and are based on version 1.1 requirements. XML schemas define ...

MovableType automatic IP banning system

Isn't it cool: A visitor to your weblog Signs on the Sand has automatically been banned by posting more than the allowed number of comments in the last 200 seconds. This has been done to prevent a malicious script from overwhelming your weblog with comments. The banned IP address is ...

July 20, 2004

Oracle patented CMS, what is next?

USPTO did it again. Fun is going on. Now Oracle has been granted a patent on CMS. Patent 6,745,238 says: The web site system permits a site administrator to construct the overall structure, design and style of the web site. This allows for a comprehensive design as well as a ...

SchemaCOP is coming?

Gudge writes: On my team we have a bunch of guidelines for writing XML Schema documents. For a while we've been checking schema against the guidelines. Unfortunately the implementation of the checker was in wetware, rather than software. Recently, I found an hour or two to put together a software ...

July 19, 2004

XML Schema: Component Designators WD

This is an interesting one: The XML Schema Working Group has released a revised Working Draft of XML Schema: Component Designators. The document defines a scheme for identifying the XML Schema components specified by the XML Schema Recommendation Part 1 and Part 2. The idea is to be able to ...