June 21, 2004

Validating Doctype-less documents against DTD

Here is another interesting puzzle to solve - how would you validate Doctype-less XML document (which has no Doctype declaration) against DTD? ...

XmlValidatingReader needs Doctype to validate document, that's is the only way to enforce it to perform validation against DTD - give it XML with Doctype. Ok, my first impulse was to write a simple custom XmlTextWriter, which exposes synthetic Doctype while reading Doctype-less documents:
XmlTextReader -> DoctypeAppendingXmlReader -> XmlValidatingReader
That's a bit nontrivial as PublicID and SystemID should be exposed as synthetic attributes too, but quite doable with a small state machine. Unfortunately that doesn't work. XmlValidatingReader still doesn't validate even being given a Doctype. A bit of reflection unveiled that when encontering Doctype, XmlValidatingReader asks XmlTextReader for DTD via its internal property. Obviously it's null as XmlTextReader doesn't see the synthetic Doctype. No way, won't work.

Ok, then what? Another approach is to modify XML before validation by appending Doctype. No XmlDocument or XSLT here please, that's a job for XmlReader-XmlWriter pipeline. Here is the modifying code:

XmlReader r = new XmlTextReader("foo.xml");
XmlWriter w = new XmlTextWriter("foo2.xml", Encoding.UTF8);
bool hasDoctype = false;
while (r.Read()) 
{
    if (r.NodeType == XmlNodeType.DocumentType)
        hasDoctype = true;
    else if (r.NodeType == XmlNodeType.Element) 
    {
        if (!hasDoctype) 
        {
            //First element is about to be written - insert Doctype
            w.WriteDocType(r.Name, null, "foo.dtd", null);        
        }        
    }
    w.WriteNode(r, false);
}
r.Close();
w.Close();
//Now let's validate modified one
XmlValidatingReader vr = new XmlValidatingReader(
    new XmlTextReader("foo2.xml"));    
while (vr.Read());
Works fine, but requires temporary buffer/file/whatever to hold modified version of the document. Not really satisfying. Any other ideas? I think a better solution exists. E.g. if during validation document is going to be loaded into some in-memory XML store anyway, then some sort of in-memory validation might help, but I doubt that trick will work for DTD validation.