July 30, 2004

Another form of blog comment spam: indirect referencing

I just got several instances of what I believe is another resourceful form of blog comment spam. It looked like an ordinar spam, somehow making it through MT-Blacklist system I've got installed and after "Name: free government grants" I was aready clicking on "De-spam using MT-Blacklist" link, but then I ...

July 28, 2004

Tell me who are you and what are you processing

This is small trick for newbies looking for a way to get URI of a source XML and the stylesheet from within XSLT stylesheet. ...

Unfortunately neither XPath 1.0 nor XSLT1.0 don't provide any solutions for the problem. Usual answer is "pass it as a parameter". That's a good one, but not always suitable (e.g. when transforming client side with <?xml-stylesheet?> PI). Next answer is "use Saxon's saxon:system-id() extension function or write your own". Latter is what I'm going to illustrate.

Simple, ain't it:

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:ext="http://mycompany.com/mynamespace">

  <msxsl:script language="javascript" 
    implements-prefix="ext">
    function uri(nodelist) {
      return nodelist[0].url;
    }
  </msxsl:script>
  
  <xsl:template match="/">
    <p>Currently processing XML document URI is 
      <tt><xsl:value-of select="ext:uri(/)"/></tt></p>
    <p>Currently processing XSLT stylesheet URI is 
      <tt><xsl:value-of select="ext:uri(document(''))"/></tt></p>
  </xsl:template>  
</xsl:stylesheet>
The result (try http://www.tkachenko.com/samples/detect-uri.xml) is:
Currently processing XML document URI is http://www.tkachenko.com/samples/detect-uri.xml

Currently processing XSLT stylesheet URI is http://www.tkachenko.com/samples/detect-uri.xsl

PS. Of course extension functions are not portable and above works only in IE/MSXML3+.

PPS. Of course it only works well when XML documents are loaded from a URI, not generated on the fly.

PPPS. XPath 2.0 will fix the problem providing fn:document-uri and fn:base-uri() functions.

July 25, 2004

Breadth-first tree traversal in XSLT

As a matter of interest - how would you implement breadth-first tree traversal in XSLT? Traditional algorithm is based on using a queue and hence isn't particularly suitable here. Probably it's feasible to emulate a queue with temporary trees, but I think that's going to be quite ineffective. Being not ...

Let's say we've got the following XML:

<a>
  <b>
    <d/>
    <e/>
  </b>  
  <c>
    <f/>
    <g/>
  </c>
</a>
Easy to see that being traversed in a breadth-first way, the sequence of visited nodes would be a, b, c, d, e, f, g. How?

Strightforward declarative solution is to traverse nodes level by level - just select all nodes at a level i, then i+1 etc. till maximum depth level is reached:

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">  
    <xsl:call-template name="BFT">
      <xsl:with-param name="tree" select="/descendant::*"/>
    </xsl:call-template>
  </xsl:template>
  <!-- Breadth-first traversal template -->
  <xsl:template name="BFT">
    <xsl:param name="tree" select="/.."/>
    <xsl:param name="depth" select="0"/>
    <xsl:variable name="nodes-at-this-depth" 
       select="$tree[count(ancestor::*)=$depth]"/>
    <xsl:apply-templates select="$nodes-at-this-depth"/>
    <xsl:if test="count($nodes-at-this-depth)>0">
        <xsl:call-template name="BFT">
          <xsl:with-param name="tree" select="$tree"/>
          <xsl:with-param name="depth" select="$depth + 1"/>       
        </xsl:call-template>
    </xsl:if>    
  </xsl:template>
  <!-- Actual node processing -->
  <xsl:template match="*">    
    <xsl:value-of select="name()"/>    
  </xsl:template>
</xsl:stylesheet>
Not so effective, but anyway. For each depth level we need to count all ancestors of each node in the list. Looks like worst case running time of the above implementation is O(maxDepth*n*maxDepth) = O(maxDepth2*n), where n is the number of nodes.

Using keys it can be improved to O(maxDepth*n + maxDepth*O(1)) = O(maxDepth*(n+1)), where maxDepth*n is an indexing price and maxDepth*O(1) is running time of retrieving nodes from the index by depth level value (provided keys implementation is based on a hashtable):

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:key name="depthMap" match="*" use="count(ancestor::*)"/>
  <xsl:template match="/">
    <xsl:call-template name="BFT"/>
  </xsl:template>
  <!-- Breadth-first traversal template -->
  <xsl:template name="BFT">
    <xsl:param name="depth" select="0"/>
    <xsl:variable name="nodes-at-this-depth" 
      select="key('depthMap', $depth)"/>
    <xsl:apply-templates select="$nodes-at-this-depth"/>
    <xsl:if test="count($nodes-at-this-depth)>0">
      <xsl:call-template name="BFT">
        <xsl:with-param name="depth" select="$depth+1"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>
  <!-- Actual node processing -->
  <xsl:template match="*">
    <xsl:value-of select="name()"/>
  </xsl:template>
</xsl:stylesheet>

O(maxDepth*n) is definitely better than O(maxDepth2*n), but still worse than procedural O(n). In the worst case (deep lean tree) it gets even O(n2). But on average XML trees, which are usually wide and rather shallow, it's close to procedural algorithm's running time.

More ideas?

SgmlReader and namespaces

It's obvious, but I didn't realize that till recently - Chris Lovett's SgmlReader doesn't supprot namespaces. Why? SgmlReader is SGML reader in the first place and you know, there is no namespaces in SGML. So whenever you want to cheat and process malformed XML with SgmlReader - beware of namespaces. ...

July 22, 2004

How to get ASCII encoded XML document while writing arbitrary unicode data (e.g. when transforming with XslTransform class)

Sometimes some of us want to narrow encoding of an output XML document, while to preserve data fidelity. E.g. you transform some XML source with arbitrary Unicode text into another format and you need the resulting XML document to be ASCII encoded (don't ask me why). Here is fast and ...

One could ask - btw, can ASCII encoded XML document to contain arbitrary Unicode characters? While it sounds a bit contradictory, the answer is of course - sure it can, because encoding of an XML document (that one which you can see in XML declaration) is sort of "transfer encoding", while character range for any XML document is always the whole set of legal characters of Unicode and ISO/IEC 10646 (more strict definition). XML syntax allows any (but legal in XML) character to be written as numeric character reference, e.g. &#1489; (Hebrew letter BET). So the solution for the given problem of narrowing the encoding is to encode all characters that don't fit into target encoding as numeric character references. Basically that's what XSLT 1.0 spec requires from XSLT processors:

It is possible that the result tree will contain a character that cannot be represented in the encoding that the XSLT processor is using for output. In this case, if the character occurs in a context where XML recognizes character references (i.e. in the value of an attribute node or text node), then the character should be output as a character reference; otherwise (for example if the character occurs in the name of an element) the XSLT processor should signal an error.

Well, unfortunately native .NET 1.X XSLT processor - XslTransform class doesn't support that yet. So let's see how we can get this done. The first solution that comes in my mind is simple custom XmlWriter, which filters output text and encodes all non ASCII characters as numeric character references. Just like SAX filter for those SAX minded. Here is the implementation:

public sealed class ASCIIXmlTextWriter : XmlTextWriter 
{
  //Constructors - add more as needed
  public ASCIIXmlTextWriter(string url) :
    base(url, Encoding.ASCII) {}

  public override void WriteString(string text)
  {
    StringBuilder sb = new StringBuilder(text.Length);
    foreach (char c in text)
    {
      if (c > 0x0080)
      {
        sb.Append("&#");
        sb.Append((int)c);
        sb.Append(';');
      }
      else
      {
        sb.Append(c);
      }
    }
    base.WriteRaw(sb.ToString());
  }
}
Let's test it. Source XML:
<?xml version="1.0" encoding="UTF-8"?>
<message>English, Русский (Russian), עברית (Hebrew)</message>
Dummy stylesheet:
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <xsl:copy-of select="message"/>
  </xsl:template>
</xsl:stylesheet>
The code:
XPathDocument doc = new XPathDocument("foo.xml");
XslTransform xslt = new XslTransform();
xslt.Load("foo.xsl");
ASCIIXmlTextWriter writer = new ASCIIXmlTextWriter("out.xml");
xslt.Transform(doc, null, writer, null);
writer.Close();
And here is the result (note that all non ASCII charcters have been written as numeric char references):
<message>English, &#1056;&#1091;&#1089;&#1089;&#1082;&#1080;&#1081; 
(Russian), &#1506;&#1489;&#1512;&#1497;&#1514; (Hebrew)</message>
Which looks fine both in IE and Mozilla:

Above simple solution handles equally well both text and attribute values, but not comments and PIs (but that's easy as well). It's neither optimized nor tested, but you've got the idea.

Justification of XHTML

W3C has published "HTML and XHTML FAQ" document. "Why is XHTML needed? Isn't HTML good enough?", "What are the advantages of using XHTML rather than HTML?. Rather interesting refresh WRT to recent discussion in xml-dev list. ...

Small but cool

Isn't it cool to have a small personal page at microsoft.com? :) Every MVP got such one recently. Here is mine (aka http://aspnet2.com/mvp.ashx?olegt). And here is the XML MVPs gang. ...

July 21, 2004

XML Schema 1.1, First Working Draft

Oh boy! 2004-07-19: The XML Schema Working Group has released the First Public Working Draft of XML Schema 1.1 in two parts: Part 1: Structures and Part 2: Datatypes. The drafts include change logs from the XML Schema 1.0 language and are based on version 1.1 requirements. XML schemas define ...

MovableType automatic IP banning system

Isn't it cool: A visitor to your weblog Signs on the Sand has automatically been banned by posting more than the allowed number of comments in the last 200 seconds. This has been done to prevent a malicious script from overwhelming your weblog with comments. The banned IP address is ...

July 20, 2004

Oracle patented CMS, what is next?

USPTO did it again. Fun is going on. Now Oracle has been granted a patent on CMS. Patent 6,745,238 says: The web site system permits a site administrator to construct the overall structure, design and style of the web site. This allows for a comprehensive design as well as a ...

SchemaCOP is coming?

Gudge writes: On my team we have a bunch of guidelines for writing XML Schema documents. For a while we've been checking schema against the guidelines. Unfortunately the implementation of the checker was in wetware, rather than software. Recently, I found an hour or two to put together a software ...

July 19, 2004

XML Schema: Component Designators WD

This is an interesting one: The XML Schema Working Group has released a revised Working Draft of XML Schema: Component Designators. The document defines a scheme for identifying the XML Schema components specified by the XML Schema Recommendation Part 1 and Part 2. The idea is to be able to ...

July 10, 2004

{ First 10 digit prime in consecutive digits of e }.com

Ok, this is not a new one, but just for those who somehow missed it (just like me). A cool puzzle to solve: { First 10 digit prime in consecutive digits of e }.com How much time does it take for you to crack it? My full time is about ...

July 8, 2004

Online Chat with Microsoft XML Team today

Don't miss it! ...

July 7, 2004

Antenna XSL Formatter Lite released

Antenna House released first lite version of their famous XSL Formatter (XSL-FO to PDF). It's much more cheaper than full version (only $300 for Windows version), but has a bit annoying (at least for me) limitations: Total page number of the formatted pages are limited to 300. The watermark that ...

Tricky XSLT optimization

Rick Jelliffe writes: Perhaps some tricky implementation of XSLT could figure out if a stylesheet is streamable and switch to a streaming strategy. That would be rather effective optimization indeed. But how that could be implemented in XSLT/XQuery processor? Obviously full-blown stylesheet analysis would be feasible only having schema information ...

July 4, 2004

VSIP SDK 2005 Beta 1 released

Oh boy, what a month. Here is another juicy release I wish I had any free time to dig in: VSIP SDK 2005 Beta 1. ...

July 2, 2004

Visual Studio 2005 Beta1 at downloads

Visual Studio 2005 Beta1 is available for MSDN subscribers. And as ordinar ISO CD images, not 2.7Gb bundle. Let's make some good traffic today! ...

July 1, 2004

Tired of spam

I'm tired of comment spam... It reached 15-30 spam instances/day level and finally I decided to install MT-Blacklist plugin for my blogging engine. 5 minutes of installation, updaing the blacklist, deep de-spamming and that it, I'm clean and protected. Well done, Jay Allen! Hope it's gonna help. Anyway if you ...

New XML Editor in Visual Studio 2005 Beta 1

Cool news from the XML Editor Team (announced by Chris Lovett): Announcing: New XML Editor in Visual Studio 2005 Beta 1 Visual Studio 2005 Beta 1 contains a completely new XML Editor, built on top of the core text editor provided by Visual Studio. It is entirely written in C ...