-------------------------------------------------
XML and Databases
Fall 2000 Final Project
COM 3360 Adaptive Object Oriented Software Design
Prasenjit Adak (adakp@ccs)
Stylianos Paparizos (stelios@ccs)
-------------------------------------------------
Project Directory
==================
/course/com3360/f00/project-dir/XMLandDB/
or
/proj/adaptive3/lieber/com3360-2000/XMLandDB/
Description
===========
The project involves an adaptive application allowing user interaction
with a database using XML input and output.
The user can submit information into the database or
request information to be retrieved from it.
We are assuming that there exists a relational database
and the data publisher wants to maximize access
to the information lying in it. So the database administrator
has given us the necessary information
concerning the data structure.
The communication with the database is established using JDBC.
So a data publisher can adapt our project
to his database by just providing the necessary JDBC driver
and the data structure. The use of JDBC allows
the same behavior for other data collection types
like spreadsheets, text files and other structured formats.
The user's input is converted to a SQL statement on the fly, and
then applied to the database through a JDBC connection.
This allows a user-friendly input form so the user does
not need to know specific SQL commands.
For a query request, the output produced by JDBC is converted
to an XML format and returned to the user.
The application allows for almost every SQL query to be
executed. It has only to be adapted to the user friendly
XML input format. But it will also allow the user
to request data using transitive closure.
So the application will respond to a request
like "show me all employees and the employees that he/she manages"
in a hierarchical form.
This can return a tree of infinite depth and there
is no standard SQL command available for achieving this.
Out of the well known database management vendors only Oracle has a
specific command that can retrieve information in this format.
For testing our implementation, we used a MS Access database with two tables:
employee : emp_id, emp_name, dept_id, mgr_id, salary
department : dept_id, dept_name
The file xmldb.mdb can be found in the project directory.
The input
=========
Structure of the input request
------------------------------
The outermost tags bounding an input request must be
and . The complete request
consists of two main parts, as shown below:
Connection-parameters
Action
The Connection-parameters section has to be bounded
by the tags and , and must contain
all of the following 4 data items, in the given order.
The values have to be specified as strings
(i.e. enclosed within double-quotes).
i. the vendor name
ii. the URL for the database
iii. the userid for connecting to the database, and
iv. the password for connecting to the database
The vendor name must already be included in
the class-dictionary. The JDBC driver and SQL statement
formats also must be specified in the class-dictionary.
A sample entry is given below:
"generic" "jdbc:odbc:xmldb" "anonymous" "guest"
In this example, the name of the vendor is "generic",
the URL for the database is "jdbc:odbc:xmldb", and
the userid/password to be used for connecting
to the database is "anonymous" and "guest".
The vendor name is just a logical name for grouping a
set of SQL statement formats and associating a JDBC driver
to the group. It doesn't have to correspond to real-world
vendor names.
The Action part may specify a Query or an Insert.
A Query has several parts and has the following structure:
ColumnName1,1 : ColumnDataType1,1
...
ColumnName1,m : ColumnDataType1,m
[
ColumnName2,1 : ColumnDataType2,1
...
ColumnName2,n : ColumnDataType2,n ]
[ Condition ]
[
[] Columnx
...
[] Columny
]
[
[] ParentColumn
[] ChildColumn
RelationColumn
]
The only mandatory data item in a query specification
is the first table name and its associated columns.
Everything else (shown enclosed within [ and ] above)
is optional. At most two tables, and any number of
columns belonging to the tables may be specified in
the query request. The currently recognized column
data types are Integer, Float and Text. These values
must be specified exactly like this, and must not be
enclosed within double-quotes. All other data items
in this section must be enclosed within double quotes.
The table names in the Order section, and the Parent,
Child sections under ShowHierarchy are optional and
may be omitted if the corresponding column name is
unambiguous.
The Condition section is used to specify a condition
string which may actually consist of any number
of atomic conditions connected by and/or and parentheses.
Any double quote in the condition string should be
replaced by a single quote, for correct parsing of the input.
The Order section can be used to mention a list of
columns by which the resultset is to be ordered.
The ShowHierarchy section can be used to transform
a linear resultset obtained directly from a database
into a hierarchical form. It can be used to achieve
transitive closure using two columns (e.g. employee-id
and manager-id in the employee table) in the same table,
or to create an inner join using two columns belonging
to two different tables. The column corresponding to
the enclosing entity (e.g. manager) is to be specified
as the Parent, while the other column (e.g. a subordinate
employee) must be specified as the Child. The name of the
virtual column (obviously belonging to the parent entity)
that represents the one-to-many relationship between
the parent and the child (e.g. manages) is specified
as the RelationColumn.
An Insert specification is simpler and consists of the following parts:
ColumnName1 : ColumnDataType1
ColumnValue1
...
ColumnNamen : ColumnDataTypen
ColumnValuen
Only one table (but any number of columns belonging
to that table) can be specified in an insert request.
A value corresponding to each column must also be
specified. As in the case of query requests, all
data items (except for the column data type) must
be given as a string (i.e. enclosed within double quotes).
Even if a column value is intended to be of a numeric
data type, it must be specified as a string, but
the column data type specification should reflect
the appropriate type (Integer or Float for numeric
values, and Text for textual values including dates).
The table and column names must all be already
defined in the class-dictionary, otherwise the
request will not be accepted.
As in XML, there is no restriction about
including white space around the tags and data values.
Input Files
-----------
There are 4 input files and the corresponding output files
in the base directory using which the program can
be tested:
Query1.input - a simple select from the "employee" table;
Query2.input - a select from the same table, but one which requests
for a transitive closure using the columns mgr_id and emp_id;
Query3.input - a query requesting for a join between the tables
"employee" and "department" with the "dept_id" column (present in both
tables) acting as the bridge,
Insert.input - a simple insert statement into the employee table.
Limitations
===========
The current implementation of our project has certain limitations.
The project idea is something very general and an application
such as this might require a big team of people working for
several months to make a commercially viable product.
1. It does not validate the input request from the user. It only
checks its syntax. We assume that the user knows the database structure
that is provided by the data publisher. We considered XML validation to
be out of the scope of this project.
2. The current support for relationships is limited to only
one-to-one or one-to-many relationships. There is no support
for many-to-many relationships. The main reason for that
would be that the XML output needs special consideration
in the case of many-to-many relationships. An XML document
is very similar to a hierahical database, and
many-to-many relationships seem almost impossible to implement.
3. Also the application currently only allows relationships between
either one or two entities. A relationship between an entity
and itself gives rise to a transitive closure, while a
relationship involving two entities is viewed as a join.
4. Currently, the parent (or outer) table in a join query request has to
be specified first in a query request to produce correct XML output.
5. It only supports relationships defined using a single attribute.
6. The application does not support computed columns.
Possible extensions:
====================
There are some proposed extensions for the project.
1. It might be useful to overcome some of the limitations mentioned
in the above section.
2. The best interface for an adaptive application is the web.
The application allows a java servlet to communicate with it.
The input can be an InputStream or a String. Developping a
corresponding java servlet was part of our initial goal but we
had to leave it out because of time constraints.
3. The use of Metadata provided by JDBC might allow for
on the fly evaluation of the underlying database connection structure.
This would minimize the information needed to be given
by the data publisher to the database url and driver only.
This suggestion is highly optimistic since JDBC does not
work all the times as promised by Sun.
4. The output of the program does not produce a DTD or XML schema
definition. A possible XML schema would be ideal with
XML validation software.
5. It would be nice to have an option for printing a singly-rooted
hierarchy of objects starting with a specified entity as the root. All
siblings of this entity (which currently appear as roots of parallel
trees) would be dropped from the display. This would work well
in transitive closure cases to submit a request like "show the
organization chart for all employees under Jesse Jackson".
6. The JDBC connections currently allow only queries that return
some output or insert queries. A possible extension would be to allow
update queries and other not so often cases. We did not implement that
because usually a user wants information out of the database.
The updates and changes of structure are being done by the data publisher.
Also more reasonable messages concering the SQL failures would
help the user in a real-world application.