December 19, 2004

Kurt Cagle makes a business case for XSLT 2.0

As usually very long post (an article actually) by Kurt Cagle on "The Business Case for XSLT 2.0". Explains why XSLT 2.0 is good and why Microsoft should implement it. With Michael Champion's comments, worth reading. ...

Would you like to see XSLT1.1 + EXSLT in .NET2?

Hey, I've got another idea. XQuery and XSLT2 are surely huge undertakings (we can truly thank W3C for that), but still there is plenty of plain poor .NET devs struggling with limitations of XSLT 1.0 and XPath 1.0. What if Microsoft implements XSLT 1.1 + EXSLT in .NET 2.0, would ...

XSLT 1.1 is that officially frozen XSLT version, which was supposed to improve XSLT in an evolutionary way - by solving only the most irritating problems in XSLT 1.0. Changes from XSLT 1.0 are small: that nasty result tree fragment data type is eliminated, so no need for xxx:node-set() function; XML namespaces quirks are fixed; support for XML Base is added; multiple output is supported via new xsl:document instruction; xsl:apply-imports can have parameters; standard way to define embedded extension functions is defined - xsl:script; new "external object" data type is defined for better interop with extension languages.

No big deal to implement IMO, but what a relief for .NET devs working with XSLT. And it's quite stable - XSLT 1.1 isn't Recommendation actually, but it's oficially frozen and won't change anymore. Saxon and jd.xslt support it. And implementing EXSLT would provide rich function library to allow XSLT developers to be much more productive by eliminating the boring needs every time to reimplement from scratch such trivial tasks as string tokenizing, formatting dates or getting list of unique values. EXSLT.NET project proves it's pretty implementable.

Yes, one can use EXSLT.NET right now, but EXSLT.NET library has some serious limitations. It's perf and security problems I'm talking about. The main problem is about how EXSLT.NET is implemented. Main idea behind EXSLT was that XSLT vendors would implement it, while EXSLT.NET is just external layer on top of the XslTransform class. It's implemented as user extension functions, not system extensions like msxsl:node-set() function. Hence - awful lots of reflection work is done during each function call and on returning a node-set and of course FullTrust security demand, which makes EXSLT.NET plain useless in any not fully trusted environment such as ASP.NET. All these problems could be fixed easily by just moving EXSLT.NET into the core of the XSLT implementation - it would make it faster, safer and more reliable.

Well, just an idea to evaluate actually.

In other .NET related XML news

Some XML news in no any order: Irwin Dolobowsky says we should expect very interesting articles at MSDN XML Dev Center, especially I'm looking forward to this one - "Helena Kupkova will show us how to create bookmarks in XML Streams with the ResetableXmlReader." Hmmm, sweet. AFAIR we've been discussing ...

Red pill for Michael Champion

Oh that big news - Michael Champion is now Program Manager for XML Standards in the Microsoft's XML WebData team. Wow, wow, wow - that's the only words I can say. Here is his intro on his new blog (hey, he is a Microsoft employee, so it's http://blogs.msdn.com/mikechampion, not http://weblogs.asp.net/mikechampion ...

Architecture of the World Wide Web, Volume One

W3C at last published the "Architecture of the World Wide Web, Volume One" as W3C Recommendation. It was cooked in long hot discussions by Web heavyweights and geeks. Here is what's that about: This document describes the properties we desire of the Web and the design choices that have been ...

It's 47 printed pages and I had no time to read it thoroughly yet, but I skimmed XML-related parts. There are some normative answers to some bloated questions finally.

Binary vs Text data formats:

The trade-offs between binary and textual data formats are complex and application-dependent. Binary formats can be substantially more compact, particularly for complex pointer-rich data structures. Also, they can be consumed more rapidly by agents in those cases where they can be loaded into memory and used with little or no conversion. Note, however, that such cases are relatively uncommon as such direct use may open the door to security issues that can only practically be addressed by examining every aspect of the data structure in detail.

Textual formats are usually more portable and interoperable. Textual formats also have the considerable advantage that they can be directly read by human beings (and understood, given sufficient documentation). This can simplify the tasks of creating and maintaining software, and allow the direct intervention of humans in the processing chain without recourse to tools more complex than the ubiquitous text editor. Finally, it simplifies the necessary human task of learning about new data formats; this is called the "view source" effect.

It is important to emphasize that intuition as to such matters as data size and processing speed is not a reliable guide in data format design; quantitative studies are essential to a correct understanding of the trade-offs. Therefore, designers of a data format specification should make a considered choice between binary and textual format design.
Oh yeah, well said.

When to use XML:

XML defines textual data formats that are naturally suited to describing data objects which are hierarchical and processed in a chosen sequence. It is widely, but not universally, applicable for data formats; an audio or video format, for example, is unlikely to be well suited to expression in XML. Design constraints that would suggest the use of XML include:

1. Requirement for a hierarchical structure.
2. Need for a wide range of tools on a variety of platforms.
3. Need for data that can outlive the applications that currently process it.
4. Ability to support internationalization in a self-describing way that makes confusion over coding options unlikely.
5. Early detection of encoding errors with no requirement to "work around" such errors.
6. A high proportion of human-readable textual content.
7. Potential composition of the data format with other XML-encoded formats.
8. Desire for data easily parsed by both humans and machines.
9. Desire for vocabularies that can be invented in a distributed manner and combined flexibly.

On linking in XML:

Designers of XML-based formats may consider using XLink and, for defining fragment identifier syntax, using the XPointer framework and XPointer element() Schemes.
Note that "may". It means "we'd like to see at least anybody using XLink, though we admit it's not so good." It's still an issue.
XLink is not the only linking design that has been proposed for XML, nor is it universally accepted as a good design.

On our favorite nightmare - XML namespaces. It's always an issue (aka it's too long), go read it. Some related to the misunderstanding Dare was writing about:

Attributes are always scoped by the element on which they appear. An attribute that is "global," that is, one that might meaningfully appear on elements of many types, including elements in other namespaces, should be explicitly placed in a namespace. Local attributes, ones associated with only a particular element type, need not be included in a namespace since their meaning will always be clear from the context provided by that element.
The type attribute from the W3C XML Schema Instance namespace "http://www.w3.org/2001/XMLSchema-instance" ([XMLSCHEMA], section 4.3.2) is an example of a global attribute. It can be used by authors of any vocabulary to make an assertion in instance data about the type of the element on which it appears. As a global attribute, it must always be qualified. The frame attribute on an HTML table is an example of a local attribute. There is no value in placing that attribute in a namespace since the attribute is unlikely to be useful on an element other than an HTML table.

And here are some new definitions for a very bloated topic:

Another benefit of using URIs to build XML namespaces is that the namespace URI can be used to identify an information resource that contains useful information, machine-usable and/or human-usable, about terms in the namespace. This type of information resource is called a namespace document. When a namespace URI owner provides a namespace document, it is authoritative for the namespace.

There are many reasons to provide a namespace document. A person might want to:

- understand the purpose of the namespace,
- learn how to use the markup vocabulary in the namespace,
- find out who controls it and associated policies,
- request authority to access schemas or collateral material about it, or
- report a bug or situation that could be considered an error in some collateral material.
A processor might want to:

- retrieve a schema, for validation,
- retrieve a style sheet, for presentation, or
- retrieve ontologies, for making inferences.
In general, there is no established best practice for creating representations of a namespace document; application expectations will influence what data format or formats are used. Application expectations will also influence whether relevant information appears directly in a representation or is referenced from it.
Well, I'm not sure I fully agree with this practice, but at least it sounds reasonable and clear.

On QNames in content problem:

Do not allow both QNames and URIs in attribute values or element content where they are indistinguishable.

XML ID problem - still not solved.

Media types for XML:

In general, a representation provider SHOULD NOT assign Internet media types beginning with "text/" to XML representations.
Read again that. Use what RFC 3023 says - "application/xml" and all that jazz with "+xml" suffix (e.g. "image/svg+xml"). Also:
In general, a representation provider SHOULD NOT specify the character encoding for XML data in protocol headers since the data is self-describing.

So lots of cool stuff to read and follow.