Next: References Up: KEYNET: Fast indexing for Previous: Deletion.

Summary.

KEYNET is a graph-oriented method for document indexing and retrieval. Documents must be annotated with small semantic networks that represent their key concepts. A larger semantic network (part of a subject-specific ontology) determines which node and link types (basic concepts and relations) are considered pertinent to a subject during document retrieval.

The query graph actually used for retrieval may substitute more general or specific concepts for those specified by the user. Retrieval does not match large components of the query graph against whole keynets of documents. Instead, the query graph is fragmented into small probes of bounded size. These fragments are matched against document keynets, and resulting retrieval sets of potentially relevant documents are combined using fragment-oriented weights.

The graph representations, their fragmentation, and post-retrieval merging of document sets associated with distinct fragments naturally introduce a "fuzziness" appropriate to information retrieval notions of relevance, and facilitate use of parallel processing resources (at appropriate stages). Although it was not the inspiration for KEYNET, it is interesting to compare it with human memory. Like human memory it is associative and semantically rich. It is also fault-tolerant: loss of a few probes during retrieval or loss of some of the buckets would make it somewhat harder to find a document but would not prevent it. It accomplishes this fault-tolerance by randomly distributing many index terms for each document throughout the entire index. Human memory is believed to have a similar feature. Human memory is highly parallel, and in the parallel version of the KEYNET system, the index has a fixed size with memories fading rather than abruptly disappearing.

The KEYNET system employs a number of optimizations to ensure that it scales up to large document collections and so that it has high performance. Fragmentation combined with pattern associativity of graph structures and linear hashing techniques produces tractable complexity of communications and computation, despite necessary isomorphism testing and index manipulation. The system is therefore compatible with the requirements for search engines needed in proposals for an NII.

The proposed retrieval mechanism can be applied to instances of itself, producing a quasi-encyclopedic classification of documents. The following diagram suggests how one could organize search engines in the NII:

The top-level search engine appears to be a ``hotspot'' in this scheme, but it would not have to be consulted for every query. Most individuals would only use a few search engines and would not need to consult the top-level engine once the locations of these engines were known.

Using currently available technology, each search engine could support collections having one million or more documents. The top-level search engine could, in turn, support one million search engines. The entire structure would therefore index documents having an aggregate storage size of bytes, i.e., 10,000 Terabytes.



Next: References Up: KEYNET: Fast indexing for Previous: Deletion.


kenb@ccs.neu.edu
Fri Jan 20 22:04:00 EST 1995