Next: Searching. Up: Index Structure. Previous: Index Structure.

Data Structure.

The overall structure of the index is a hash table, with each component being called a bucket. The buckets, in turn, have the structure of a cache, a structure not often used in an index. For this reason, we call our index structure a hash-cache index.

Each probe is converted to a hash index using a standard hashing algorithm as in Knuth[section 6.4]Knuth73. For a probe p, let h(p) be its hash value, a string of exactly n bits. We write f(k,p) for the first k bits of h(p), where . Similarly, write l(k,p) for the last k bits of h(p). Write w(p) for the weight of the probe p. Depending on the machine architecture, there will either be a fixed number of buckets in the table or the number of buckets can increase as needed. When the number of buckets can vary, we employ the dynamic hashing scheme due to Litwin[Lit80] which he has recently generalized to the case of distributed systems[LNS93]. When the number of buckets is fixed, the number is a power of two, say . The bucket for a probe p is then determined by f(m,p). The rest of the hash value, l(n-m,p), is then used for searching within the bucket. The same technique holds for the case of a variable number of buckets except that m is not necessarily the same for every bucket.

Within each bucket data is arranged as a tree, where l(n-m,p) is the index. The tree must be balanced for a parallel machine architecture. Note that probes are not represented in the tree structure. One only has (parts of) hash values and document identifiers. For this to work, it is necessary for the number of bits n in the hash value to be large enough so that collisions will be very unlikely. Unlike traditional hashing algorithms, KEYNET can tolerate occasional collisions because of the redundancy inherent in using large numbers of probes. This optimization results in a significant reduction in the overall size of the index since (partial) hash values are much smaller than probes in general. It also improves the speed of indexing by replacing relatively costly string comparisons with fixed-size integer comparisons. For more detail about the probability model underlying this optimization, see the Appendix.



Next: Searching. Up: Index Structure. Previous: Index Structure.


kenb@ccs.neu.edu
Fri Jan 20 22:04:00 EST 1995