Non-Extractive XML Parsing

| No Comments | No TrackBacks

Well, I'm working on decreasing the size of the "Items for Read" folder in RSS Bandit. Still many to catch up, but anyway. XML.com has published "Non-Extractive Parsing for XML" article by Jimmy Zhang. In the article Jimmy proposes another approach to XML parsing - using "non-extractive" style of tokenization. In simple words his idea is to treat XML document as a static sequence of characters, where each XML token can be identified by the offset:length pair of integers. That would give lots of new possibilities such as updating a part of XML document without serializing of unchanged content (by copying only leading and trailing buffers of text instead), fast addressing by offset instead of ids or XPath, creating binary index for a document ("parse once, use many times" approach).

While sounding interesting (and not really new as being sort of remake of the idea of parsing XML by regexp) there is lots of problems with "non-extractive" parsing. XML in general doesn't really fit well into that paradigm. Entities and inclusions, encoding issues, comments, CDATA and default values in DTD all screw up the idea. Unfortunately that happens with optimization techniques quite often - they tend to simplify the problem. It probably will work only with a very limited subset of XML, but it's fruitfullness still needs to be proven.

Another shortcoming of "non-extractive" parsing is the necessity to have entire source XML document accessible (obviously offsets are meaningless with no source buffer at hands). That would mean the buffering the whole (possibly huge) XML document in a streaming scenario (e.g. when you read XML from a network stream).

Still that was interesting reading. Indexing of an XML document, how does it sound? Using IndexingXPathNavigator it's possible to index in-memory IXPathNavigable XML store and to select nodes directlty by key values instead of traversing the tree. That works, but there is still lots of room for developement here. What about persistent indexes? What if XslTransform would be able to leverage existing indexes instead of building its own (for xsl:key) on each transformation?

Related Blog Posts

No TrackBacks

TrackBack URL: http://www.tkachenko.com/cgi-bin/mt-tb.cgi/249

Leave a comment