This file is the documentation for the JSR31 project Authors: Yong Shao, Caizhi Zhu Date: Fall, 2000 1. MOTIVATION This project will provide AN XML data binding facility for Java platform. It will compile an XML schema into Java classes and extracts the constraint from the schema. Consequently, it is possible to use the result to validate an XML document. The only existing Java APIs for manipulating XML are the low-level SAX parser API and the somewhat higher-level DOM parse-tree API. The process is generally tedious and error-prone. The resulting code is likely to contain many redundancies and hard to maintain as the XML schema evolves. 2. INTENT The goal of this project is to tackle the problem using an adaptive approach. Instead of writing code for each new schema, we write an adaptive program so that it can compile any XML schema without the need to change any code. Then the generated code can be used to validate if an XML document conforms with the schema. 3. DESIGN The scheme is illustrated in the following. A DemeterJ project has these files: a project file, a class dictionary file, one behavior (or more) file, and an input file. We divide the project in two steps: parse and validate. In parse step, we write a class dictionary that contains a subset of features in XML schema, and behavior files that can extract the constraints from the schema. The XML schema will be the parse.input. The parse step thereby generates two new files: a new class dictionary that defines all XML documents that conforms to the XML schema that was parsed; and a behavior that contains all validation methods from the constraints. Now the validate step has a cd file and beh file. Given a project file, it can read in any XML document as input file and validate whether it conforms to the XML schema. Compared with DTD or any other programming language, XML schema is very expressive and has a huge collection of features. There are two principles in our design: 1). The feeatures must be implemented correctly. The program will strictly follow the XML schema definitions. No compromise will be made to modify XML schema to make our programming easier. This will guarantee that more features can be build on this projrect. Otherwise the project will crash when new features are added. For example, the namespace concept is one of the core features in XML schema. An XML schema may have more than one namespace to expand its namespace and expressive power. Implementing namespace means the need of an LL(2), even an LL(3) parser or a sohpisticated ways to get around it in class dictionary. Although hardcoding the namespace makes the project much easier to implement, it doesn't work in a real world, and it restricts the expressive power of XML schema. Another example is XML datatypes. It would be easy to assume all XML datatypes are string, everything else can be parsed from there. It simplyfies (actually over-simplifies) the project, but doesn't reflect the true power of XML schema because XML schema has many data types and that's exactly why it's so expressive. We implemented as many XML schema datatypes as we could under the restrictions of the cd and the timeframe of the project. 2). As mentioned before, because of the limited time and resource, it's impossible to implement all the features. We choose a subset of features that are representative and possible to implement under the time and resource restrictions. The basic but very useful features such as min/max occurrence, range restriction etc are therefore implemented in our project, and it is possible to implement these feature in such a short time. The approach we take to generate the new files are a mixed approach. We write to a string buffer as we go. At the end we write the buffer out. This adds some flexibility to manipulate the string and the final output. This is extremely useful when inserting attributes into the new cd file. 4. IMPLEMENTATION We take a modified progressive approach to this project. We think it is very every important that we understand the XML schema and the whole picture in order to write a correct and truly adaptive cd. This provides consistency and avoids re-writing cd when new features are added. we know the features are in the cd, but they're implemented one at a time. We made two simplifications to make it possible to finish the project in time: first we assume the first element in the schema is the top level element. This is practically possible and reasonable. And it is not too hard to enhance the program without this assumption. Given enough time, we'd be able to do it. the second assumption arises from the difficulties to express an XML string. Here we have to add quote to XML string in order to parse the input by the class dictionary. If cd can express space separated string, this assumption would be unnecessary. We tried Ident and Text for this datatype and it didn't work. 5. RESULT & DISCUSSION 1) Features Implemented a. XML Namespace: the namespace is optional. In our implementation, it's actually possible to have more than one namespace. b. min/max occurrence constraint: if minOccur = 0, the part is optional, if maxOccurrence > 1, the part is a list. The generated new cd reflects this constraint and validate the occurrence of an element. c. range restrictions: including (min/max)(Inclusive/Exclusive). This constraint is not restricted to integer in our implementation. It's possible to handle float, double or many other XML datatypes although we didn't have time to implement all the code. d. datatypes: we can handle several of the true XMl datatypes, such as string, decimal. integer, posiInteger, float, date and it's easier to add more XMl datatypes to our project. 2) Testing Results Steps to run the program: a. under the directory where the proj.tar file is extracted, type "demjava test" or "demeterj". This will parse the XMl schema in parse.input file and generate the new cd and beh files fro the validation. These files are put under validator directory (generated at run time). b. type "cd validator" to go to the validator directory. c. there are three input XML documents for testing: input1 is an XMl document that conforms to the schema; input2 is an invalid XML document that violates the maxOccur constraint (maxOccur is 5, document has 7, so two error messages are expected); and input3 that violates the range restriction. To run each test, copy the input file to validate.input by typing "cp input1 validate.input". d. type "demjava test" or "demeterj" again. If the input is correct, there will be no error message. Otherwise an appropriate message will appear on the screen. 6. Future Development This is a very interesting and challenging project. Because of the time and resource constraint, we can only implement some of the features. There are things we planned to do but didn't have the time to and things that are very interesting but we didn't have the time to do. We could also refine our implementation if we could have more time. Here is a brief listing: a. Attribute: This is a basic feature of XMl schema and XMl documents. We implemented attribute in the class dictionary file, but didn't have the time to write the code to insert the attributes to the new cd file. This shouldn't be hard. The only trick is that because attributes appear after elements, they have to be inserted. Therefore an offset (the string buffer length) needs to be recorded in order to insert attributes in the appropriate place. b. More XML datatypes: Such as long, short, recurringDUration, even customer defined data types. Given our class dictionary, it's not hard to do. c. More sophisticated constraints: Constraints such as a pattern match, enumeration. Our class dictionary doesn't have enumeration (neither does Java). It would be interesting to see how these could be implemented. d. More elaborate error messages: The error message is very primitive because of the time constraint. More elaborate error message will help not only the users, but also the developers during testing. Acknowledgment: We'd like to thank Prof. Lieberherr for his guidance, help and patience during the project. We also want to thank Doug Orlean for his prompt help in the project.