Next: Information Retrieval. Up: KEYNET: Fast indexing for Previous: KEYNET: Fast indexing for

Introduction.

With the expansion of the Internet and the development of new ``Information Highways,'' computer-based communication is becoming the defining technology of this decade. A number of proposals have been made to build a coherent structure over these new high-bandwidth networks to convert them into a National Information Infrastructure (NII). For example, the proposed I-95 Information Market [T+93] proposes an infrastructure that facilitates the free-market purchase, sale and exchange of services. The amount of information that will be available on the I-95 or any other NII is immense: on the order of billions of objects and hundreds of terabytes of data. Finding information in such an environment is a monumental task but essential to the success of the infrastructure. Any solution to this problem must satisfy two important requirements: it must be scalable to handle the large number of information objects and it must be semantically rich enough to support effective information retrieval (IR).

We propose an architecture and indexing strategy for a search engine that would satisfy these requirements. The KEYNET system would support information retrieval for a collection of information objects in a single subject area, such as a collection of biological research articles, a set of court cases, files containing remote geophysical sensor data, or even collections of software programs and modules. KEYNET assumes that the information objects in its collection have been annotated using small semantic networks of key concepts (keynets) that are consistent with a subject-specific concept ontology. Although there are certainly important distinctions between semantic networks, knowledge frames and objects in an object-oriented database, none of these distinctions are important for the purposes of KEYNET, so they will be used interchangeably within the context of this article.

A keynet is similar to an abstract or review of a document both in size and in being separately accessible from its corresponding document. If good tools are available, it could be generated by the author of the document with no more effort than is now taken to write the abstract or to select the keywords. It may eventually be possible to use natural language processing techniques to generate keynets, but this is only possible for textual information objects. A third possibility is to have professional annotators construct the keynets. This is less costly than one might expect. The biological research literature consists of some 600,000 articles per year. It would cost less than $30million per year to annotate this literature with keynets, a tiny amount compared to the cost of generating this literature in the first place[BF93].

To give a simple example of a keynet consider an imaginary paper entitled, ``POOQ: A parallel, object-oriented query system.'' Let's say that the paper uses some known dynamic programming algorithms to optimize queries for use on parallel machine architectures. A keyword-based approach would classify this paper by using phrases such as ``parallel algorithms,'' ``object-oriented databases,'' ``dynamic programming,'' and ``query optimization.''

Keywords have the advantage that they may include topically relevant words and phrases that do not appear in the documents themselves. Furthermore, some information objects (like images, scientific data and software) have no text that can reasonably be used by traditional IR technology. Keynets enrich the possible semantics as compared with keywords based on simple subject classification schemes by adding relationships between keywords. In addition to being more expressive than sets of keywords, keynets would also generally be larger, though still much smaller than the entire information object.

To describe a keynet for the imaginary POOQ paper above, we must first describe a hypothetical ontology for Computer Science. Suppose that one of the classes in this ontology is concerned with the concept of translation or transformation. The keynet for the POOQ paper could begin with an instance of the transformation class which is linked via an attribute edge labeled ``input'' to an object having the subtype ``declarative language,'' and so on. Using semantic networks the keynet would look something like that shown in Figure 1. In this figure, the label of each node consists of its type followed by its value (if any) separated by a colon. Since the output language is well-known, no further elaboration is needed. However, the input language, POOQ, is not well-known, so it must be specified further with attributes like its name and other attributes not shown.

The I-95 infrastructure proposal specifically mentions the need for ``content labels [on information objects] to permit users to learn about available resources.'' These content labels are designed to deal with the problem of scaling up to a full-scale NII. Keynets would provide a solution to the design of such content labels. In spite of supporting the rich semantic content possible with semantic networks, the KEYNET system uses efficient indexing techniques based on hashing to make it a fast and scalable search engine.

Since the purpose of the KEYNET system is information retrieval, the paper begins with some background on IR in section 2. Despite the perceived inefficiency of semantic network techniques in IR, there are systems that use such methods, and we discuss such related work in section 3. We then describe the overall system architecture of KEYNET.

The KEYNET system is an information retrieval system for a subject area as specified in a database called the ontology. The ontology is discussed in section 5. The objects in the collection will be called documents even though, as noted earlier, the KEYNET system works equally well with information objects that are not documents in the usual sense of this term. The most common use of KEYNET begins with a user query. The query may be expressed using natural language text which is parsed to obtain a semantic network, or else a graphic interface may be used to express the query directly in terms of semantic networks. The same ontology is used for defining the structure of semantic networks of queries as that used for keynets. For example, the query ``What systems use dynamic programming to generate C* code?'' would be translated into a semantic network like that in Figure 2.

Having converted the original query into a semantic network, the next step is to break the network into small semantic networks having a bounded size. We call these fragments probes. Fragmentation is discussed in section 6. These fragments are indexed using a hash index structure defined in section 7. The keynets are also fragmented and the fragments inserted into the index as described in subsection 7.3. The search step consists of hashing the probes and looking for matches with fragments of keynets of documents. In the example, the POOQ document would have fragments that match every probe of the query. The search algorithm is described in more detail in subsection 7.2.

We are in the process of developing a prototype KEYNET system for information retrieval on two document collections. The first is a small document collection and ontology prepared by the Biological Knowledge Laboratory [BFH+93] at Northeastern University. The second will be a full-size, randomly-generated document collection. The first collection will be used for testing recall and precision, while the second collection will be used for testing database performance. For a discussion of techniques for rapid generation of random databases see [GEBW93].



Next: Information Retrieval. Up: KEYNET: Fast indexing for Previous: KEYNET: Fast indexing for


kenb@ccs.neu.edu
Fri Jan 20 22:04:00 EST 1995