Both queries and content labels are represented using a data structure
called a *keynet*.
Intuitively, a keynet is semantically intermediate between a keyword
and a semantic network.
We build up the formal definition of a keynet in stages.
The mathematical basis for the `KEYNET` model is first presented.
Then the notion of a keynet ontology is defined, and it is used as the basis
for defining the concept of a keynet.
The keynet structure is used as the data structure for content labels, queries,
index terms, etc.

The underlying mathematical structure on which keynets are based
is the *directed graph*.
See [Section 5.4]
for the basic
definitions.
A directed graph consists of a set of vertices and a set of (directed) edges
that link one vertex with another (possibly the same) vertex.
Vertices and edges will generally have additional structure such as textual
labels, informal explanations, etc.
Edges, in particular, will be labeled with *type* information.
When drawing a directed graph, one uses boxes for vertices and solid
arrows for edges, each of which is labeled with some of the
additional structure belonging to it.
The *dimension* of a directed graph is the number of edges it has.
A directed graph is *connected* if every pair of vertices can be
linked by a sequence of edges (not necessarily all going in the same
direction).

As we mentioned earlier, a `KEYNET` system requires a subject-specific
knowledge model or ontology.
The word ontology literally means ``a branch of metaphysics relating to
the nature and relations of being.''
Our use of the word is much more restrictive, dealing only with the
nature of, and relationships among, concepts within a narrow subject area.
Attempts to specify ontologies for scientific disciplines are very common,
with most disciplines having some kind of subject classification scheme
by this time.

The `KEYNET` system depends on having a background ontology that defines
the structure and behavior of keynets.
A *keynet ontology* consists of three parts:

- A directed graph called the
*schema*.- The vertices can be regarded as the set of of the ontology. They consist primarily of the subject classification categories for the subject covered by the corpus. However, a vertex can also be a property or attribute of an information object, such as an author, its year of publication, etc.
- The links of the schema join conceptual categories.
They are grouped into
*link types*. The links in one link type share a common semantics. The best known example of a link type is the ISA link type, which is used to define the concept hierarchy for a subject classification. Other link types include ``part of,'' ``elaboration of,'' ``cause of,'' etc.

- A set of terms or concepts, called the
*lexicon*or*thesaurus*. One think of the terms either as lexical terms, and hence the*keywords*of the ontology, or on higher semantic level as concepts (each of which can be expressed in many ways, by words or phrases). While a lexical term is often a word or phrase, it can also be a person, a number, or a range of names or numbers. - A many-to-many relation from the lexicon to the set of conceptual categories that specifies which category (or categories) a lexical term specializes or instantiates. Each lexical term is required to be an instance of at least one conceptual category.

The reason for splitting concepts into the schema and lexicon levels is pragmatic. While there will generally be only about 100 conceptual categories and even fewer link types, there will typically be on the order of 100,000 lexical terms. By defining links only at the schema level, where there are fewer conceptual categories to deal with, it is easier to understand the ontology.

One commonly used IR mechanism is the use of attributes to identify documents. For example, the names of authors or words from the title. These are represented in the ontology with ``Author'' and ``Title'' concept categories. The lexicon would have the names of authors and (key)words from titles (or ranges of these if the number of authors or title words was too large). Ranges can also be used to express ``wild card'' matches.

The notion of a keynet ontology is the basis for the concept of a keynet.
A keynet *conforming* to a keynet ontology consists of the
following:

- A directed graph.
- The vertices of the directed graph are copies of vertices of the ontology. There can be several copies in a keynet of one vertex in the ontology.
- The edges of the directed graph are copies of edges in the ontology. Each edge of the keynet must link copies of the source and destination vertices of the corresponding edge in the ontology: the keynet must faithfully reflect the corresponding structure in the ontology.

- A set of lexical terms that are copies of lexical terms in the ontology.
- A one-to-one function from lexical terms to vertices. This function specifies the keynet vertex that is being instantiated by each lexical term. Unlike the ontology where the relationship between lexical terms and categories is many-to-many, each category vertex in a keynet may be instantiated at most once by a lexical term, and every lexical term instantiates exactly one vertex.

For example, the following keynet would be used for a content label of the document ``High-performance, distributed information retrieval,'' by John Smith and Jane Doe:

The examples given so far have not made use of any edges to link conceptual categories. In Figure 2, there is an example of a keynet conforming to the UMLS ontology.

This keynet can be expressed in English in several ways.
Its expression as a query using English automatically
generated using stock phrases is
``Find all documents in which the plant potato has an
acquired abnormality bunion.''
Its expression as a statement in a content label is
``The potato can exhibit an abnormality functionally similar
to bunions.'' Finally a more succinct colloquial expression would be:
``Can potatoes get bunions?''
While this may seem to be a somewhat whimsical query,
it makes perfectly good sense, and we will later discuss
in section 6
how a `KEYNET` system would understand and respond to such a query.

Keynets are used not only for content labels and queries but also for
index terms. However, we do not use arbitrary keynets as index terms
because that would result in a ``combinatorial explosion'' of possibilities.
Rather we restrict attention to a special kind of keynet.
A *fragment* of a keynet is a connected subset of the keynet.
In other words, a fragment consists of a choice of vertices, edges and
lexical terms from a keynet such that the set of vertices and edges so chosen
forms a connected directed graph. For example, this is a
fragment of the keynet in Figure 2:

and here is another:

However, the keynet

is not a fragment because it is not connected.

The main result for fragments is the fact that the number of fragments having small dimension grows linearly with the size of the keynet from which they are taken.

For a proof of the theorem, see [BS94].
The bound in the theorem is actually a worst-case bound, and in
practice one does much better, typically just 4 times the number
of vertices of the keynet.
A fragment having exactly one edge is called a *clasp*,
while a fragment having exactly two edges is called a *double clasp*.

The reason for limiting fragments to be at most double clasps is not just pragmatic. It arises from a well-known result from Cognitive Science known as the rule[Mil56]. This rule states that the maximum number of conceptual ``chunks'' that can be manipulated by a person at one time is in the range 5 to 9 chunks. A clasp consists of 3 to 5 parts (the two vertices, the edge and up to two instances of the vertices), and a double clasp consists of 5 to 8 parts. Therefore limiting to just one vertex or a single clasp would be too limiting, while a double clasp almost exactly matches the rule.

Intuitively, retrieval in the `KEYNET` system proceeds by matching fragments
of a query (called *probes*) with fragments of content labels
(called *index terms* or *index fragments*).
The degree of relevance is then determined by a similarity measure
between the vector of probes and the vector of index terms.
For this to scale up to large corpora,
the number of index terms must not be too large.
In particular, if we allowed arbitrary subsets of a content label to be
index terms, then the number of index terms would grow exponentially in
the size of the content label.
The theorem assures us that the size of the index will be manageable.

Fri Jan 20 21:43:28 EST 1995