------------------------------------------------- XML and Databases Fall 2000 Final Project COM 3360 Adaptive Object Oriented Software Design Prasenjit Adak (adakp@ccs) Stylianos Paparizos (stelios@ccs) ------------------------------------------------- Project Directory ================== /course/com3360/f00/project-dir/XMLandDB/ or /proj/adaptive3/lieber/com3360-2000/XMLandDB/ Description =========== The project involves an adaptive application allowing user interaction with a database using XML input and output. The user can submit information into the database or request information to be retrieved from it. We are assuming that there exists a relational database and the data publisher wants to maximize access to the information lying in it. So the database administrator has given us the necessary information concerning the data structure. The communication with the database is established using JDBC. So a data publisher can adapt our project to his database by just providing the necessary JDBC driver and the data structure. The use of JDBC allows the same behavior for other data collection types like spreadsheets, text files and other structured formats. The user's input is converted to a SQL statement on the fly, and then applied to the database through a JDBC connection. This allows a user-friendly input form so the user does not need to know specific SQL commands. For a query request, the output produced by JDBC is converted to an XML format and returned to the user. The application allows for almost every SQL query to be executed. It has only to be adapted to the user friendly XML input format. But it will also allow the user to request data using transitive closure. So the application will respond to a request like "show me all employees and the employees that he/she manages" in a hierarchical form. This can return a tree of infinite depth and there is no standard SQL command available for achieving this. Out of the well known database management vendors only Oracle has a specific command that can retrieve information in this format. For testing our implementation, we used a MS Access database with two tables: employee : emp_id, emp_name, dept_id, mgr_id, salary department : dept_id, dept_name The file xmldb.mdb can be found in the project directory. The input ========= Structure of the input request ------------------------------ The outermost tags bounding an input request must be and . The complete request consists of two main parts, as shown below: Connection-parameters Action The Connection-parameters section has to be bounded by the tags and , and must contain all of the following 4 data items, in the given order. The values have to be specified as strings (i.e. enclosed within double-quotes). i. the vendor name ii. the URL for the database iii. the userid for connecting to the database, and iv. the password for connecting to the database The vendor name must already be included in the class-dictionary. The JDBC driver and SQL statement formats also must be specified in the class-dictionary. A sample entry is given below: "generic" "jdbc:odbc:xmldb" "anonymous" "guest" In this example, the name of the vendor is "generic", the URL for the database is "jdbc:odbc:xmldb", and the userid/password to be used for connecting to the database is "anonymous" and "guest". The vendor name is just a logical name for grouping a set of SQL statement formats and associating a JDBC driver to the group. It doesn't have to correspond to real-world vendor names. The Action part may specify a Query or an Insert. A Query has several parts and has the following structure: TableName1
ColumnName1,1 : ColumnDataType1,1 ... ColumnName1,m : ColumnDataType1,m [ TableName2
ColumnName2,1 : ColumnDataType2,1 ... ColumnName2,n : ColumnDataType2,n ] [ Condition ] [ [ Tablenamex
] Columnx ... [ Tablenamey
] Columny
] [ [ ParentTable
] ParentColumn
[ ChildTable
] ChildColumn
RelationColumn
]
The only mandatory data item in a query specification is the first table name and its associated columns. Everything else (shown enclosed within [ and ] above) is optional. At most two tables, and any number of columns belonging to the tables may be specified in the query request. The currently recognized column data types are Integer, Float and Text. These values must be specified exactly like this, and must not be enclosed within double-quotes. All other data items in this section must be enclosed within double quotes. The table names in the Order section, and the Parent, Child sections under ShowHierarchy are optional and may be omitted if the corresponding column name is unambiguous. The Condition section is used to specify a condition string which may actually consist of any number of atomic conditions connected by and/or and parentheses. Any double quote in the condition string should be replaced by a single quote, for correct parsing of the input. The Order section can be used to mention a list of columns by which the resultset is to be ordered. The ShowHierarchy section can be used to transform a linear resultset obtained directly from a database into a hierarchical form. It can be used to achieve transitive closure using two columns (e.g. employee-id and manager-id in the employee table) in the same table, or to create an inner join using two columns belonging to two different tables. The column corresponding to the enclosing entity (e.g. manager) is to be specified as the Parent, while the other column (e.g. a subordinate employee) must be specified as the Child. The name of the virtual column (obviously belonging to the parent entity) that represents the one-to-many relationship between the parent and the child (e.g. manages) is specified as the RelationColumn. An Insert specification is simpler and consists of the following parts: TableName
ColumnName1 : ColumnDataType1 ColumnValue1 ... ColumnNamen : ColumnDataTypen ColumnValuen
Only one table (but any number of columns belonging to that table) can be specified in an insert request. A value corresponding to each column must also be specified. As in the case of query requests, all data items (except for the column data type) must be given as a string (i.e. enclosed within double quotes). Even if a column value is intended to be of a numeric data type, it must be specified as a string, but the column data type specification should reflect the appropriate type (Integer or Float for numeric values, and Text for textual values including dates). The table and column names must all be already defined in the class-dictionary, otherwise the request will not be accepted. As in XML, there is no restriction about including white space around the tags and data values. Input Files ----------- There are 4 input files and the corresponding output files in the base directory using which the program can be tested: Query1.input - a simple select from the "employee" table; Query2.input - a select from the same table, but one which requests for a transitive closure using the columns mgr_id and emp_id; Query3.input - a query requesting for a join between the tables "employee" and "department" with the "dept_id" column (present in both tables) acting as the bridge, Insert.input - a simple insert statement into the employee table. Limitations =========== The current implementation of our project has certain limitations. The project idea is something very general and an application such as this might require a big team of people working for several months to make a commercially viable product. 1. It does not validate the input request from the user. It only checks its syntax. We assume that the user knows the database structure that is provided by the data publisher. We considered XML validation to be out of the scope of this project. 2. The current support for relationships is limited to only one-to-one or one-to-many relationships. There is no support for many-to-many relationships. The main reason for that would be that the XML output needs special consideration in the case of many-to-many relationships. An XML document is very similar to a hierahical database, and many-to-many relationships seem almost impossible to implement. 3. Also the application currently only allows relationships between either one or two entities. A relationship between an entity and itself gives rise to a transitive closure, while a relationship involving two entities is viewed as a join. 4. Currently, the parent (or outer) table in a join query request has to be specified first in a query request to produce correct XML output. 5. It only supports relationships defined using a single attribute. 6. The application does not support computed columns. Possible extensions: ==================== There are some proposed extensions for the project. 1. It might be useful to overcome some of the limitations mentioned in the above section. 2. The best interface for an adaptive application is the web. The application allows a java servlet to communicate with it. The input can be an InputStream or a String. Developping a corresponding java servlet was part of our initial goal but we had to leave it out because of time constraints. 3. The use of Metadata provided by JDBC might allow for on the fly evaluation of the underlying database connection structure. This would minimize the information needed to be given by the data publisher to the database url and driver only. This suggestion is highly optimistic since JDBC does not work all the times as promised by Sun. 4. The output of the program does not produce a DTD or XML schema definition. A possible XML schema would be ideal with XML validation software. 5. It would be nice to have an option for printing a singly-rooted hierarchy of objects starting with a specified entity as the root. All siblings of this entity (which currently appear as roots of parallel trees) would be dropped from the display. This would work well in transitive closure cases to submit a request like "show the organization chart for all employees under Jesse Jackson". 6. The JDBC connections currently allow only queries that return some output or insert queries. A possible extension would be to allow update queries and other not so often cases. We did not implement that because usually a user wants information out of the database. The updates and changes of structure are being done by the data publisher. Also more reasonable messages concering the SQL failures would help the user in a real-world application.