Implementing XML Base in .NET

| 8 Comments | No TrackBacks

XML Base is a tiny W3C Recommendation, just couple of pages. It facilitates defining base URIs for parts of XML documents via semantically predefined xml:base attribute (similar to that of HTML BASE element). It's XML Core spec, standing in one line with "Namespaces in XML" and XML InfoSet. Published back in 2001. Small, simple, no strings attached or added mind-boggling complexity. Still unfortunately neither MSXML nor System.Xml of .NET support it (Dare Obasanjo wrote once on the reasons and plans to implement it). Instead, XmlResolver is the facility to manipulate with URIs. But while XmlResolvers are powerful technique for resolving URIs, they are procedural facility - one has to write a custom resolver to implement resolving per se, while XML Base is a declarative facility - one just has to add xml:base attribute on some element and that's it, base URI for this element and the subtree is changed. So now that you see how it's useful, here is small how-to introducing amazingly simple way to implement XML Base for .NET.

So what XML Base is all about? It introduces xml:base attribute with predefined semantics (just like xml:space or xml:lang) of manipulating base URIs. xml:base attribute can be inserted anywhere in any XML document to specify for the element and its descendants base URI other than the base URI of the document or extenal entity. One purpose is to provide native XML way to define base URIs. Another purpose is resolving of relative URIs in XML documents, e.g. when document A is included into document B in some different location, relative URIs in the content of A would be broken. To keep them identifying the same resources xml:base attribute is used. If you still don't get it, take a look at a sample in the "Preserving Base URI" section of the "Combining XML Documents with XInclude" article at the MSDN Xml Dev Center. So it's basically XML's analog of the HTML's BASE tag.

Basically System.Xml supports base URIs all over the infastructure, the only problem is that basic syntax-level facilities such as XmlTextReader and XmlTextWriter ignore xml:base attribute when parsing and writing XML. Can we add such support in a transparent way? Sure. Let's take XmlTextReader, extend it in such way that each time it gets positioned on an element which bears xml:base attribute, BaseUri propery gets updated to reflect it. Here it is:

public class XmlBaseAwareXmlTextReader : XmlTextReader 
{
    private XmlBaseState _state = new XmlBaseState();
    private Stack _states = null;
    
    //Add more constructors as needed    
    public XmlBaseAwareXmlTextReader(string uri)
        : base(uri) 
    {
        _state.BaseUri = new Uri(base.BaseURI);
    }

    public override string BaseURI
    {
        get
        {
            return _state.BaseUri==null? "" : _state.BaseUri.AbsoluteUri;
        }
    }

    public override bool Read()
    {   
        bool baseRead = base.Read();
        if (baseRead) 
        {
            if (base.NodeType == XmlNodeType.Element &&
                base.HasAttributes) 
            {
                string baseAttr = GetAttribute("xml:base");
                if (baseAttr == null)
                    return baseRead;                
                Uri newBaseUri = null;
                if (_state.BaseUri == null)
                    newBaseUri = new Uri(baseAttr);        
                else
                    newBaseUri = new Uri(_state.BaseUri, baseAttr);                        
                if (_states == null)
                    _states = new Stack();
                //Push current state and allocate new one
                _states.Push(_state); 
                _state = new XmlBaseState(newBaseUri, base.Depth);
            }
            else if (base.NodeType == XmlNodeType.EndElement) 
            {
                if (base.Depth == _state.Depth && _states.Count > 0) 
                {
                    //Pop previous state
                    _state = (XmlBaseState)_states.Pop();
                }
            }
        }
        return baseRead;            
    }     
}

internal class XmlBaseState 
{
    public XmlBaseState() {}
    public XmlBaseState(Uri baseUri, int depth) 
    {
        this.BaseUri = baseUri;
        this.Depth = depth;
    }
    public Uri BaseUri;
    public int Depth;
}
Simple, huh? Now let's test it. Suppose I have a collection of XML documents in the "d:/Files" directory and a catalog XML file, such as
<catalog>
  <files xml:base="file:///d:/Files/">
    <file name="file1.xml"/>
  </files>
</catalog>
As you can see, xml:base attribute here defines base URI for files element subtree to be file:///d:/Files/ so file names are to be resolved relative to that folder no matter where catalog file is actually placed. (Of course I could have absolute URIs instead, but sure having absolute URIs hardcoded in every single place easily leads to a maintenance nightmare for any real system).

While loading this document to XPathDocument via XmlBaseAwareXmlTextReader it can be seen that base URIs are preserved as per XML Base spec:

XmlReader r = new XmlBaseAwareXmlTextReader("foo.xml");
XPathDocument doc = new XPathDocument(r);
XPathNavigator nav = doc.CreateNavigator();
XPathNodeIterator ni = nav.Select("/catalog");
if (ni.MoveNext())
  Console.WriteLine(ni.Current.BaseURI);
ni = nav.Select("/catalog/files/file");
if (ni.MoveNext())
  Console.WriteLine(ni.Current.BaseURI);
outputs
file:///D:/projects/Test/foo.xml
file:///d:/Files/
Unfortunatley XmlDocument doesn't seem to be so smart as XPathDocument on that matter and only supports base URI of the document and external entities. Too bad, too bad.

Ok, that was abstract test, now consider some XSLT processing - I load files by name for some processing using document() function. Recall that by default (single argument) document() function resolves relative URIs relatively to XSLT stylesheet's base URI (strictly speaking relatively to the base URI of the XSLT instruction which contains document() function). To resolve URIs relatively to some other base URI, second argument is used. So I'm going to pass <file> elements to the document() function as a second argumen for resolving URIs relitely to their base URI (which is defined via xml:base attribute on their parent element <files>):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="files">
    <files>
      <xsl:apply-templates/>
    </files>
  </xsl:template>
  <xsl:template match="file">
    <xsl:copy-of select="document(@name, .)"/>
  </xsl:template>
</xsl:stylesheet>
The code is as simple as
XmlReader r = new XmlBaseAwareXmlTextReader("foo.xml");
XPathDocument doc = new XPathDocument(r);
XslTransform xslt = new XslTransform();
xslt.Load("foo.xsl");
xslt.Transform(doc, null, Console.Out);
The result is
<files>
  <para>File 1 content</para>
</files>
As you can see, when using XmlBaseAwareXmlTextReader with XPathDocument one can get XML Base support for XPath and XSLT.

Alternatively I could implement XmlBaseAwareXmlTextReader as XmlReader, not as XmlTextReader (if you know the difference). And in the same simple way XML Base can be implemented for XML writing as XmlBaseAwareXmlTextWriter. Similar classes are used in XInclude.NET and I'm also going to add XmlBaseAwareXmlTextReader and XmlBaseAwareXmlTextWriter to our collection of custom XML tools in the MVP.XML project.

Update: XmlBaseAwareXmlTextReader is now part of the Common module of the MVP.XML library.

Related Blog Posts

No TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/335

8 Comments

There is a bug in the sample





Uri's pushed on the stack when (XmlReader.NodeType == XmlNodeType.Element && XmlReader.IsEmptyElement == true) are not popped after reading passed the element.

Ok, updated the code.

Oh, yeah... My fault.

When you leave the scope of an element with xml:base, shouldn't you return to the previous BaseURI scope? Maybe you need to keep a stack of Uris...

Extending DOM is too muddy stuff...

I'm wondering whether an XmlBaseAwareDocument inheriting from XmlDocument can do the trick of supporting the xml:base...

BTW, great to see you so actively working on the project!!

Yep, I'm bad on naming as always.
Talking about object model, I like that System.Xml 2.0's one - there is a settings class, you set it up, e.g. turn validation off, something other on. And then some factory object instantiates XmlReader or XmlWriter based on that settings.
Probably we need to implement it for the Mvp-Xml Common module too.

Oleg,

This stuff is very useful!

I am worried about the proliferation of too many classes with long names such as "XmlBaseAwareXmlTextWriter"

In a well-architectured Object Model (or more generally API) this should not be the case.

It seems to me that such proliferation results from working on a large area only in a peacemeal approach, dealing with the trees when the forest is not completely in view.

We need a really more compact, coherent, logical and meaningful Object Model, not a Baroque one.

Cheers,

Dimitre

Leave a comment