Next: Indexing Strategy Up: A unified approach to Previous: The KEYNET Architecture

The KEYNET Model

Both queries and content labels are represented using a data structure called a keynet. Intuitively, a keynet is semantically intermediate between a keyword and a semantic network. We build up the formal definition of a keynet in stages. The mathematical basis for the KEYNET model is first presented. Then the notion of a keynet ontology is defined, and it is used as the basis for defining the concept of a keynet. The keynet structure is used as the data structure for content labels, queries, index terms, etc.

The underlying mathematical structure on which keynets are based is the directed graph. See [Section 5.4] for the basic definitions. A directed graph consists of a set of vertices and a set of (directed) edges that link one vertex with another (possibly the same) vertex. Vertices and edges will generally have additional structure such as textual labels, informal explanations, etc. Edges, in particular, will be labeled with type information. When drawing a directed graph, one uses boxes for vertices and solid arrows for edges, each of which is labeled with some of the additional structure belonging to it. The dimension of a directed graph is the number of edges it has. A directed graph is connected if every pair of vertices can be linked by a sequence of edges (not necessarily all going in the same direction).

As we mentioned earlier, a KEYNET system requires a subject-specific knowledge model or ontology. The word ontology literally means ``a branch of metaphysics relating to the nature and relations of being.'' Our use of the word is much more restrictive, dealing only with the nature of, and relationships among, concepts within a narrow subject area. Attempts to specify ontologies for scientific disciplines are very common, with most disciplines having some kind of subject classification scheme by this time.

The KEYNET system depends on having a background ontology that defines the structure and behavior of keynets. A keynet ontology consists of three parts:

  1. A directed graph called the schema.
    1. The vertices can be regarded as the set of of the ontology. They consist primarily of the subject classification categories for the subject covered by the corpus. However, a vertex can also be a property or attribute of an information object, such as an author, its year of publication, etc.
    2. The links of the schema join conceptual categories. They are grouped into link types. The links in one link type share a common semantics. The best known example of a link type is the ISA link type, which is used to define the concept hierarchy for a subject classification. Other link types include ``part of,'' ``elaboration of,'' ``cause of,'' etc.
  2. A set of terms or concepts, called the lexicon or thesaurus. One think of the terms either as lexical terms, and hence the keywords of the ontology, or on higher semantic level as concepts (each of which can be expressed in many ways, by words or phrases). While a lexical term is often a word or phrase, it can also be a person, a number, or a range of names or numbers.
  3. A many-to-many relation from the lexicon to the set of conceptual categories that specifies which category (or categories) a lexical term specializes or instantiates. Each lexical term is required to be an instance of at least one conceptual category.
The Unified Medical Language System (UMLS) [HL93] of the National Library of Medicine is an example of a subject-specific ontology. The UMLS ontology is more complex than a keynet ontology; for example, it supports lexical relationships like synonymy as well as relationships on the concept level. However, the core of the UMLS ontology can be regarded as a keynet ontology.

The reason for splitting concepts into the schema and lexicon levels is pragmatic. While there will generally be only about 100 conceptual categories and even fewer link types, there will typically be on the order of 100,000 lexical terms. By defining links only at the schema level, where there are fewer conceptual categories to deal with, it is easier to understand the ontology.

One commonly used IR mechanism is the use of attributes to identify documents. For example, the names of authors or words from the title. These are represented in the ontology with ``Author'' and ``Title'' concept categories. The lexicon would have the names of authors and (key)words from titles (or ranges of these if the number of authors or title words was too large). Ranges can also be used to express ``wild card'' matches.

The notion of a keynet ontology is the basis for the concept of a keynet. A keynet conforming to a keynet ontology consists of the following:

  1. A directed graph.
    1. The vertices of the directed graph are copies of vertices of the ontology. There can be several copies in a keynet of one vertex in the ontology.
    2. The edges of the directed graph are copies of edges in the ontology. Each edge of the keynet must link copies of the source and destination vertices of the corresponding edge in the ontology: the keynet must faithfully reflect the corresponding structure in the ontology.
  2. A set of lexical terms that are copies of lexical terms in the ontology.
  3. A one-to-one function from lexical terms to vertices. This function specifies the keynet vertex that is being instantiated by each lexical term. Unlike the ontology where the relationship between lexical terms and categories is many-to-many, each category vertex in a keynet may be instantiated at most once by a lexical term, and every lexical term instantiates exactly one vertex.

For example, the following keynet would be used for a content label of the document ``High-performance, distributed information retrieval,'' by John Smith and Jane Doe:

The examples given so far have not made use of any edges to link conceptual categories. In Figure 2, there is an example of a keynet conforming to the UMLS ontology.

This keynet can be expressed in English in several ways. Its expression as a query using English automatically generated using stock phrases is ``Find all documents in which the plant potato has an acquired abnormality bunion.'' Its expression as a statement in a content label is ``The potato can exhibit an abnormality functionally similar to bunions.'' Finally a more succinct colloquial expression would be: ``Can potatoes get bunions?'' While this may seem to be a somewhat whimsical query, it makes perfectly good sense, and we will later discuss in section 6 how a KEYNET system would understand and respond to such a query.

Keynets are used not only for content labels and queries but also for index terms. However, we do not use arbitrary keynets as index terms because that would result in a ``combinatorial explosion'' of possibilities. Rather we restrict attention to a special kind of keynet. A fragment of a keynet is a connected subset of the keynet. In other words, a fragment consists of a choice of vertices, edges and lexical terms from a keynet such that the set of vertices and edges so chosen forms a connected directed graph. For example, this is a fragment of the keynet in Figure 2:

and here is another:

However, the keynet

is not a fragment because it is not connected.

The main result for fragments is the fact that the number of fragments having small dimension grows linearly with the size of the keynet from which they are taken.

For a proof of the theorem, see [BS94]. The bound in the theorem is actually a worst-case bound, and in practice one does much better, typically just 4 times the number of vertices of the keynet. A fragment having exactly one edge is called a clasp, while a fragment having exactly two edges is called a double clasp.

The reason for limiting fragments to be at most double clasps is not just pragmatic. It arises from a well-known result from Cognitive Science known as the rule[Mil56]. This rule states that the maximum number of conceptual ``chunks'' that can be manipulated by a person at one time is in the range 5 to 9 chunks. A clasp consists of 3 to 5 parts (the two vertices, the edge and up to two instances of the vertices), and a double clasp consists of 5 to 8 parts. Therefore limiting to just one vertex or a single clasp would be too limiting, while a double clasp almost exactly matches the rule.

Intuitively, retrieval in the KEYNET system proceeds by matching fragments of a query (called probes) with fragments of content labels (called index terms or index fragments). The degree of relevance is then determined by a similarity measure between the vector of probes and the vector of index terms. For this to scale up to large corpora, the number of index terms must not be too large. In particular, if we allowed arbitrary subsets of a content label to be index terms, then the number of index terms would grow exponentially in the size of the content label. The theorem assures us that the size of the index will be manageable.



Next: Indexing Strategy Up: A unified approach to Previous: The KEYNET Architecture


kenb@ccs.neu.edu
Fri Jan 20 21:43:28 EST 1995