High-performance, structure-shy XPath technology is available from CCIS at Northeastern

An important component of high-performance XPath technology is schema-aware XPath evaluation. When the schema is known (either a DTD or an XML schema), XPath evaluation can be made more efficient because it can be statically determined that certain parts of the object/document can be ignored. Or it can be statically determined that an XPath subexpression will never return a node and therefore that subexpression can be deleted.

This technology is useful for compiling structure-shy programs in various programming and query languages. Transformation of Structure-Shy Programs [PEPM 2007]

Northeastern's role in high-performance, structure-shy XPath

Northeastern's role in high-performance XPath comes from Adaptive Programming (a part of the Demeter project) where programs are written as adaptive methods of the form:

schema.traverse(whatToTraverse, whereToGo, whatToDo)
where whatToTraverse is an object/document conforming to the schema, whereToGo is a labeled graph specifying where to traverse and whatToDo is an object specifying what kind of processing to perform. A subset of the whereToGo specification has been incoporated in XPath. Northeastern has a Patent on Compiling XPath. The patent does not use the XPath terminology because XPath did not yet exist.

The Northeastern technology covers the following XPath axis: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, attribute and self.

Although the Adaptive Programming technology has been mainly used for in-store documents/objects, it is also useful for XML streams. It supports skipping over entire chunks of streams that contain irrelevant information.

The topic of high-performance XPath compilation is clearly being looked at by research and industry. Evidence follows below. So Northeastern's technology is valuable in an IP sense. We would like to turn this interest in high performance XPath into a mutually beneficial research collaboration with the goal to improve the performance of XQuery and any of the many technologies that rely on XPath.

Evidence that Northeastern's technology is valuable in an IP sense

Microsoft

.NET Framework Developer Center

Also available directly from: http://msdn2.microsoft.com/en-us/library/bb308960.aspx#xlinqoverview_topic5b

Microsoft works on: Schema-Aware XML Programming

The above document states: LINQ to XML uses a generic tree type: XElement. Hence, XML trees are essentially processed in an untyped manner. This situation can be improved substantially if there is metadata that can be used to generate Common Language Runtime types that contain the knowledge of how the XML is structured and the appropriate simple types. XML Schema can be leveraged for exactly this purpose.

LINQ to XML supports the Descendents operator and this operator can be implemented efficiently in a schema-aware context using Northeastern's technology.

Oracle

Oracle's Mark Scardina describes the compilation of XPath expressions using a DTD or XML schema into index graphs. He describes a synchronous data model traversal as used in the compilation of adaptive methods.

Oracle Use of Adaptive Programming Compilation Technology

Local Copy: Oracle Use of Adaptive Programming Compilation Technology

The high performance part of the Northeastern technology comes from using the schema to speed the XPath processing as described in the Oracle document.

In the Oracle Technology Network Mark Drake writes:

"...understand that in 10gR2 XML DB you will need to use schema based storage to get high performance, and that it doesn't matter whether you use XPath or XQuery, what matters is whether or not the XPath or XQuery operations get re-written correctly so the database can optimize them."

The Northeastern technology rewrites the XPath operations based on the schema and produces a new XPath expression (called the TraversalGraph) that optimizes the data access in the sense that it never accesses data that is irrelevant to the query based on the static schema information. Oracle uses these kinds of optimizations in both XDB and XDK.

Saxonica: XSLT and XQuery Processing

http://www.saxonica.com/

Michael Kay has a product that does schema-aware XPath processing. Currently he only optimizes the most common XPath expressions.

Berkeley

Michael Franklin has directed a multi-year project on high-performance XML filtering. Although schema schema information was not used, the researchers say that using schemas is expected to lead to further speedup.
@article{DAFZF03,
author = {Yanlei Diao and Mehmet Altinel and Michael J. Franklin and Hao Zhang and Peter Fischer},
title = {Path sharing and predicate evaluation for high-performance XML filtering},
journal = {ACM TODS},
volume = {28},
number = {4},
year = {2003},
pages = {467--516},
publisher = {ACM Press},
}

University of Konstanz

@article{GKT04,
author = {Torsten Grust and Maurice Van Keulen and Jens Teubner},
title = "{Accelerating XPath evaluation in any RDBMS}",
journal = {ACM TODS},
volume = {29},
number = {1},
year = {2004},
pages = {91--131},
publisher = {ACM Press},
}
This paper focuses on schema-less XML data but says about schema awareness: "lack of schema awareness may also give away possible performance improvements. ... schema information may be useful to discover empty pre/post plane regions at query compile time."

An interesting research topic is to combine index-based processing and schema-awareness.

Recent activity:

http://wiki.eclipse.org/PsychoPathXPathProcessor/UserManual

Karl Lieberherr, March, 2010.