August 3, 2005

Saxon 8.5, new optimizations and abilities

Michael Kay has released Saxon XSLT and XQuery processor v8.5. This new release implements some very interesting optimizations (available only in commercial version though) and new abilities, one of which is probably worth to implement in EXSLT.NET module. ...

Optimizations include:

* hash join optimization for both XSLT and XQuery (can give fantastic speed-up when processing large documents)
Hash join is well-known technique implemented by many RDBMS engines (including SQL Server) used to optimize set-matching operations. You can read about hash join here.
* a binary disk representation of validated source documents, reducing the document loading costs when the same document is used repeatedly by many transformations
Here goes binary PSVI representation, for sure a proprietary one, but portable binary is an oxymoron anyway.
* a mechanism for sequential XSLT processing of input documents without reading the whole document into memory, making it feasible to process very large documents provided the transformation is serial in nature
That reminds me what Mike said a year ago:
Firstly, there's a range of techniques that come under the heading of parallelism. Xalan, for example, has the parser and tree builder for the source document running in parallel with the transformation engine: if the stylesheet needs access to nodes that aren't there yet, the transformation engine waits until the parser has delivered them. The real saving here would come if it was also possible to discard parts of the tree once the stylesheet has finished with them. Unfortunately no-one seems close to solving this problem, even though many stylesheets do process the source document in a broadly sequential way.
Looks like he solved it. It would be interesting to measure the performance boost this optimization provides.

Additionally free Saxon version now is able to process the whole directory of files using this syntax:

document("dir?recurse=yes;select=*.html;parser=org.ccil.cowan.tagsoup.Parser")
Which returns all *.html files in the "dir" directory, processed recursively and converted to XML using TagSoap parser. It seems to be pretty useful and a piece of cake to implement at the same time. I've heard many times users asked for wildcards in document() function and every time the answer was - go write custom resolver and combine documents somehow.

If anybody needs this functionality (wildcards/recursive processing for the document() function + ability to load HTML documents) - speak up and I'd go and implement it for the EXSLT.NET module.

Gobo Eiffel XSLT - XSLT 2.0 Processor written in Eiffel

Colin Paul Adams has announced Gobo Eiffel XSLT - free XSLT 2.0 processor written in Eiffel. Gexslt is intended to conform to a Basic-level XSLT 2.0 Processor and currently is still under development. Win32 compiled version can be downloaded at http://www.gobosoft.com/download/gobo34.zip. ...

August 1, 2005

XML Enhances Java

If you thought it's only Microsoft who's working on integrating XML into the core of programming languages, look at what IBM does for Java. This is a manstream trend now. XML Enhancements for Java are an emerging technology from IBM, which provides a set of language extensions that facilitate XML ...

What are XML Enhancements for JavaTM?

XML Enhancements for Java (XJ) are a set of extensions to Java 1.4 that integrate support for XML, XML Schema and XPath 1.0 into the language. The advantages of XJ over existing mechanisms for XML development are:

o Familiarity (for the XML Programmer) : XML processing in XJ is consistent with open XML standards.
o Robustness : XJ programs are strongly typed with respect to XML Schemas. The XJ compiler can detect errors in uses of XPath expressions and construction of XML data.
o Easier Maintenance: Since XJ programs are written in terms of XML and not low-level APIs such as DOM or SAX, they are easier to maintain and modify if XML Schemas change.
o Performance: Since the compiler is aware of the use of XML in a program, it can optimize the runtime representation, parsing, and XPath evaluation of XML.
In XJ, one can import XML schemas just as one does Java classes. All the element declarations in the XML schema are then available to programmers as if they were Java classes. Programmers can write inline XPath expressions on these classes, and the compiler checks them for correctness with respect to the XML schema. In addition, the compiler performs optimizations in order to improve the evaluation of XPath expressions. A programmer may construct new XML documents by writing XML directly inline. Again, the compiler ensures correctness with respect to the appropriate schema. By integrating XML and Java, XJ allows programmers to reuse existing Java libraries in the development of XML code and vice-versa.
Here are some samples of what XJ allows:
XPath integration:
int min = 70; 
Sequence<year> ys = sd[|/year[sum(.//sales) > $min]|];
Inline XML construction:
region r = new region(<region> 
    <name>NorthEast</name> 
    <sales unit='GBP'>75</sales> 
</region>);
Dynamic one:
float conversion = 1.9; 
salesdata s = 
  new salesdata( 
    <salesdata> 
      <year> 
        {y} 
        <sales unit='Dollars'>{grossSales * conversion}</sales> 
        {r} 
      </year> 
    </salesdata>); 
And many more. Find XJ documentation here.

New in .NET 2.0: Push-Based XML Validation with XmlSchemaValidator Class

This is a real hidden gem in .NET 2.0 everybody (including me) have pretty much overlooked. XmlSchemaValidator class from the System.Xml.Schema namespace is a push-based W3C XML Schema validatation engine. Push-based means different processing model - an opposite for pull-based one. Think about how you work with XmlWriter (push) and ...

Which scenarios does XmlSchemaValidator enable:

  1. Validation of XML in-place, whithout necessity to reparse it by reading via XmlReader. This is actually how new XmlDocument.Validate() method is implemented.
  2. Validation of custom XML or even viewed-as-XML data stores
  3. Validation during XML construction - now it's possible to create validating XmlWritrer. And I wonder why it's not done yet? That's a job for XML MVPs for sure.
  4. Partial validation
  5. Access to PSVI (Post Schema Validation Information)
  6. Retrieving Expected Particles, Attributes, and Unspecified Default Attributes - this is how XML Editor in Visual Studio 2005 smart Intellisense works.
Quite impressive list and quite impressive class.