Both queries and content labels are represented using a data structure called a keynet. Intuitively, a keynet is semantically intermediate between a keyword and a semantic network. We build up the formal definition of a keynet in stages. The mathematical basis for the KEYNET model is first presented. Then the notion of a keynet ontology is defined, and it is used as the basis for defining the concept of a keynet. The keynet structure is used as the data structure for content labels, queries, index terms, etc.
The underlying mathematical structure on which keynets are based
is the directed graph.
See [Section 5.4]
for the basic
definitions.
A directed graph consists of a set of vertices and a set of (directed) edges
that link one vertex with another (possibly the same) vertex.
Vertices and edges will generally have additional structure such as textual
labels, informal explanations, etc.
Edges, in particular, will be labeled with type information.
When drawing a directed graph, one uses boxes for vertices and solid
arrows for edges, each of which is labeled with some of the
additional structure belonging to it.
The dimension of a directed graph is the number of edges it has.
A directed graph is connected if every pair of vertices can be
linked by a sequence of edges (not necessarily all going in the same
direction).
As we mentioned earlier, a KEYNET system requires a subject-specific knowledge model or ontology. The word ontology literally means ``a branch of metaphysics relating to the nature and relations of being.'' Our use of the word is much more restrictive, dealing only with the nature of, and relationships among, concepts within a narrow subject area. Attempts to specify ontologies for scientific disciplines are very common, with most disciplines having some kind of subject classification scheme by this time.
The KEYNET system depends on having a background ontology that defines the structure and behavior of keynets. A keynet ontology consists of three parts:
The reason for splitting concepts into the schema and lexicon levels is pragmatic. While there will generally be only about 100 conceptual categories and even fewer link types, there will typically be on the order of 100,000 lexical terms. By defining links only at the schema level, where there are fewer conceptual categories to deal with, it is easier to understand the ontology.
One commonly used IR mechanism is the use of attributes to identify documents. For example, the names of authors or words from the title. These are represented in the ontology with ``Author'' and ``Title'' concept categories. The lexicon would have the names of authors and (key)words from titles (or ranges of these if the number of authors or title words was too large). Ranges can also be used to express ``wild card'' matches.
The notion of a keynet ontology is the basis for the concept of a keynet. A keynet conforming to a keynet ontology consists of the following:
For example, the following keynet would be used for a content label of the document ``High-performance, distributed information retrieval,'' by John Smith and Jane Doe:
The examples given so far have not made use of any edges to link
conceptual categories.
In Figure 2, there is an example
of a keynet conforming to the UMLS ontology.
This keynet can be expressed in English in several ways. Its expression as a query using English automatically generated using stock phrases is ``Find all documents in which the plant potato has an acquired abnormality bunion.'' Its expression as a statement in a content label is ``The potato can exhibit an abnormality functionally similar to bunions.'' Finally a more succinct colloquial expression would be: ``Can potatoes get bunions?'' While this may seem to be a somewhat whimsical query, it makes perfectly good sense, and we will later discuss in section 6 how a KEYNET system would understand and respond to such a query.
Keynets are used not only for content labels and queries but also for index terms. However, we do not use arbitrary keynets as index terms because that would result in a ``combinatorial explosion'' of possibilities. Rather we restrict attention to a special kind of keynet. A fragment of a keynet is a connected subset of the keynet. In other words, a fragment consists of a choice of vertices, edges and lexical terms from a keynet such that the set of vertices and edges so chosen forms a connected directed graph. For example, this is a fragment of the keynet in Figure 2:
and here is another:
However, the keynet
is not a fragment because it is not connected.
The main result for fragments is the fact that the number of fragments having small dimension grows linearly with the size of the keynet from which they are taken.
For a proof of the theorem, see [BS94]. The bound in the theorem is actually a worst-case bound, and in practice one does much better, typically just 4 times the number of vertices of the keynet. A fragment having exactly one edge is called a clasp, while a fragment having exactly two edges is called a double clasp.
The reason for limiting fragments to be at most double clasps is not
just pragmatic.
It arises from a well-known result from Cognitive Science known as the
rule[Mil56].
This rule states that the maximum number of conceptual ``chunks'' that can
be manipulated by a person at one time is in the range 5 to 9 chunks.
A clasp consists of 3 to 5 parts (the two vertices, the edge and up
to two instances of the vertices), and a double clasp consists of 5 to 8
parts. Therefore limiting to just one vertex or a single clasp would
be too limiting, while a double clasp almost exactly matches the
rule.
Intuitively, retrieval in the KEYNET system proceeds by matching fragments of a query (called probes) with fragments of content labels (called index terms or index fragments). The degree of relevance is then determined by a similarity measure between the vector of probes and the vector of index terms. For this to scale up to large corpora, the number of index terms must not be too large. In particular, if we allowed arbitrary subsets of a content label to be index terms, then the number of index terms would grow exponentially in the size of the content label. The theorem assures us that the size of the index will be manageable.