The overall structure of the index is a hash table, with each component being called a bucket. The buckets, in turn, have the structure of a cache, a structure not often used in an index. For this reason, we call our index structure a hash-cache index.
Each probe is converted to a hash index using a standard hashing algorithm
as in Knuth[section 6.4]Knuth73.
For a probe p, let h(p) be its hash value,
a string of exactly n bits.
We write f(k,p) for the first k bits of h(p),
where .
Similarly, write l(k,p) for the last k bits of h(p).
Write w(p) for the weight of the probe p.
Depending on the machine architecture, there will
either be a fixed number of buckets in the table
or the number of buckets can increase as needed.
When the number of buckets can vary, we employ the dynamic hashing scheme
due to Litwin[Lit80] which he has recently generalized to
the case of distributed systems[LNS93].
When the number of buckets is fixed, the number is a power of two,
say
.
The bucket for a probe p is then determined by f(m,p).
The rest of the hash value, l(n-m,p),
is then used for searching within the bucket.
The same technique holds for the case of a variable number of buckets
except that m is not necessarily the same for every bucket.
Within each bucket data is arranged as a tree, where l(n-m,p) is the index. The tree must be balanced for a parallel machine architecture. Note that probes are not represented in the tree structure. One only has (parts of) hash values and document identifiers. For this to work, it is necessary for the number of bits n in the hash value to be large enough so that collisions will be very unlikely. Unlike traditional hashing algorithms, KEYNET can tolerate occasional collisions because of the redundancy inherent in using large numbers of probes. This optimization results in a significant reduction in the overall size of the index since (partial) hash values are much smaller than probes in general. It also improves the speed of indexing by replacing relatively costly string comparisons with fixed-size integer comparisons. For more detail about the probability model underlying this optimization, see the Appendix.