Streaming XML filtering in Java and .NET

| 1 Comment | No TrackBacks

XML processing is changing. In Java SAX slowly but steadily goes away or at least goes into low level and nowadays Java with StAX is not so different from .NET XmlReader. I found it pretty interesting to compare approaches to streaming filtering XML in Java and .NET. Filtering is a very useful technique for transforming XML on the fly, while XML is being read. Filtering out parts or branches application isn't interested to process is a great way to simplify XML reading code, which is especially important in streaming XML processing which usually tends to be more complicated than in-memory based (XML DOM) processing.

Let's say we have this dummy XML and we want to extract "interesting data" out of it.

<root> <ignoreme>junk</ignoreme> <data>interesting data</data> </root> StAX API has a dedicated built-in facility for filtering - StreamFilter/EventFilter (as it happens in Java world StAX is a bit overengineered and contains actually two APIs - iterator-style and cursor-based one). Here is how it looks in Java with wonderful StAX:
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader reader = xif.createXMLStreamReader(
    new StreamSource("foo.xml"));
reader = xif.createFilteredReader(reader, new StreamFilter() {
    private int ignoreDepth = 0;

    public boolean accept(XMLStreamReader reader) {
        if (reader.isStartElement()
            && reader.getLocalName().equals("ignoreme")) {
            ignoreDepth++;
            return false;
        } else if (reader.isEndElement()
           && reader.getLocalName().equals("ignoreme")) {
           ignoreDepth--;
           return false;
        }
        return (ignoreDepth == 0);
    }
});
// move to <root>
moveToNextTag(reader);
// move to <data>
moveToNextTag(reader);
// read data
System.out.println(reader.getElementText());
reader.close();
Where moveToNextTag() is an utility method doing what its name says:
do {
    reader.next();
} while (!reader.isStartElement() && !reader.isEndElement());
XmlStreamReader actually provides method nextTag(), but weirdly enough it can't skip text (even text filtered out by an underlying filter!) and throws an exception.

Now .NET code. Unlike StAX, .NET doesn't provide any facility for XML filtering so usual approach is to implement filter as a full-blown custom XmlReader and then chain it to another XmlReader instance. As I said before implementing custom XmlReader even .NET 2.0 still sucks (holy cow - 26 abstract methods or deriving from legacy nonconormant XmlTextReader). So I'm going to use XmlWrappingReader helper I was recommending to use:

public class Test
{
    private class XmlFilter : XmlWrappingReader
    {
        public XmlFilter(string uri)
            : base(XmlReader.Create(uri)) { }

        public override bool Read()
        {
            bool baseRead = base.Read();
            if (NodeType == XmlNodeType.Element &&
                LocalName == "ignoreme")
            {
                Skip();
                return base.Read();
            }
            return baseRead;
        }
    }

    static void Main(string[] args)
    {
        XmlFilter filter = new XmlFilter("../../foo.xml");
        XmlReader r = XmlReader.Create(filter, null);
        //move to <root>
        r.MoveToContent();
        //Move to <data>
        MoveToNextTag(r);
        Console.WriteLine(r.ReadString());
    }

    private static void MoveToNextTag(XmlReader r)
    {
        do
        {
            r.Read();
        } while (!(r.NodeType == XmlNodeType.Element) &&
        !(r.NodeType == XmlNodeType.EndElement));

    }
}
Amazingly similar but not so cool because of lack of anonymous classes in .NET 2.0 (expected in .NET 3.0).

In short - what I like in Java version - built-in support for XML filtering, anonymous classes. What I don't like in Java version: filter can be called more than one time on the same position, what means that real filter implementation must support such scenario; very ascetic API, too few utility methods. What I like in .NET version: lots of useful methods in XmlReader such as Skip(), ReadToXXX() etc. What I don't like - no built-in support for filters, no anonymous methods.

Besides - if you work with StAX you can readily work with .NET XmlReader and the other way. Great unification saves hours learning for developers. I wonder if streaming XML processing API should be standardized?

Related Blog Posts

No TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/605

1 Comment

You can find a very interesting solution from Mainsoft for Cross Platform Development - Java to .NET.

Leave a comment