Regression testing of XML-based applications: A Cautionary Tale about XML Karl Lieberherr July 4, 2000; draft 2 1. Improving Robustness of XML documents XML-based applications use XML documents as inputs and outputs that satisfy certain schemas. Testing those applications means to write numerous XML documents that serve as inputs and that serve to compare with outputs. When the schema underlying the inputs and outputs change, it is a tedious task to update all the documents. Unfortunately, XML documents encode a lot of schema information in themselves and therefore they are not robust under schema changes. A solution to this problem is to use, as an implementation technique only, a more general notation than an XML notation to describe the documents. XML is for mark-up languages and we propose to use a grammar formalism for LL(1) languages. LL(1) languages are almost as easy to parse as mark-up languages and the Java Compiler Compiler from SUN provides a good solution. LL(1) languages can be made much more robust to schema changes than mark-up languages. A sentence of an LL(1) language describes an entire family of objects, each object selected by a suitable schema. The approach to robust documents works as follows: Start with an XML schema X, view it as a context free grammar and add sufficiently many tokens to create an easy to read LL(1) language with grammar GLL. When X and GLL change, it is likely that the GLL documents are more robust than the X documents. A simple example illustrates the point: (this is a simplified form of a structure that is needed in a Partner (Customer) structure.) XML schema PartnerStructure: PartnerStructure = List(Adjacency). Adjacency = Partner List(Partner). Partner = Name. Name = String. List(S) ~ {S}. The LL(1) grammar is PartnerStructure-LL1: PartnerStructure = List(Adjacency). Adjacency = Partner ":" List(Partner). Partner = Name. Name = String. List(S) ~ "(" {S} ")". An LL(1) document Example-PartnerStructure-LL1 looks like: ( Huber : (Maier Mueller Schmid) Maier : (Vogt Hauser Naef) ... ) while the corresponding XML document Example-PartnerStructure is much more verbose: Huber Maier // etc. for Mueller Schmid and for Maier Now consider the schema evolution to: PartnerStructure = List(PartnerConnection). PartnerConnection = Partner ":" List(RelatedPartner). RelatedPartner = [RelationshipKind] Partner. RelationshipKind : Subsidiary ¦ Father ¦ Child. … Partner = Name. Name = String. List(S) ~ "(" {S} ")". While the Example-PartnerStructure-LL1 does not need any updating, the Example-PartnerStructure requires a lot of updating. end of example What does this mean for XML projects. The documents should be maintained in more robust non-XML form and the XML representation should be generated. But this raises the question: why not use LL(1) languages instead of mark-up languages? Then we would not have to maintain two schemas for the same language. Proposed Tool: Adaptive XML Input: XML schema S and documents D with respect to S. New schema S'. Also SLL and SLL' are input (see Precondition). Output: New documents D' with respect to S'. Precondition: There is an LL(1) language SLL extending S and an LL(1) language SLL' extending S' so that S' includes S. Advantage of such a tool: Many XML documents can be maintained automatically. Improvement to the Adaptive XML tool: It seems like that SLL and SLL' could be computed automatically. This would make the use of LL(1) languages pretty transparent. We have ideas for other ways to simplify the maintenance of XML documents using translation by example techniques.