Notes by Gene Cooperman, © 2009 (may be freely copied as long as this copyright notice remains)
For an introduction to hashing and its applications, see http://en.wikipedia.org/wiki/Hash_function. Unfortunately, the textbook dos not provide a thorough treatment of hashing. This note includes an algorithmic overview that includes some material not found in the textbook.
Hash array
Hash index
Hash function
Key-Value pair
Hash occupancy ratio
Hash collision
One has an interpreted language:
> x = 3
> y = x + 1
> foo = x + y
> Print y
4
> bar = x + y
> baz = 2*foo
> Print foo
7
Naive way to store variable-value pairs (key-value pairs):
Variable | Value |
---|---|
x | 3 |
y | 4 |
foo | 7 |
bar | 7 |
baz | 14 |
How can one find the value of the variable "foo" in almost constant time (i.e. without having to search through the entire table for a match with the name "foo"?)
Answer: Hash arrays.
The key is the variable name. (The key is a representation of the variable
as a possibly very large integer corresponding to the bits of its ASCII rep.)
The hash function is a function from the key to a hash index. (For example,
if the hash array has N
entries, and if the key is k
, then we might
choose a hash function h(k) = k mod N
. The function h(k) = k2 mod N
is better. Wikipedia has a discussion of good hash functions.)
Given a hash array, we can do three things (and optionally a fourth):
Each time one encounters a variable corresponding to key k
, one looks
to see if the hash array at hash index h(k)
is occupied. Let H[]
be the hash array. If H[h(k)]
is occupied,
we can look at the location in the hash array and either retrieve
the old value (example: Print foo
⇒ Print H[h(k)]
)
or set a new value (example: foo = 17
⇒ H[h(k)] = 17
).
The hash function is chosen as a pseudo-random function, h()
.
Pseudo-random means that h()
is a deterministic function, but
from all tests, it appears to be random. A good pseudo-random function
will hash two functions k1
and k2
to hash
indexes that are far apart.
But sometimes, we are unlucky. If we are
unlucky, then h(
k1) = h(
k2)
. This is called
a hash collision. In this case, we have a problem.
At H[h(k1)]', do we store
the value for the key
k1
or the value for the key k2
?
We need a solution. Luckily, there are two traditional solutions.
At each hash array slot, we store a linked list. Each element of the linked list contains a key-value pair. More precisely, it contains three fields: 1. Key (for example, variable "foo", or the ASCII representation of the string) 2. Value (current value of the variable "foo") 3. Link pointer to next link entry
We can easily keep track of the number of values in the hash array. Increment the number of elements each time we add a new key to the hash array. The hash occupancy ratio is the number of hash entries over the number of hash array entries (or the number of hash indices).
As long as the hash occupancy ratio is small enough (and assuming we use a good pseudo-random hash function that doesn't hash many elements to the same hash index), the time to add a new key or access an old key is almost constant. This is a key property:
A well-designed hash array will not use more than nearly constant time on average to enter a new key-value pair, access an old key-value, or remove an old key-value pair.
If the hash occupancy ratio becomes too large, the time to enter or access a key grows. This will happen if we did not know in advance the total number of keys that we will encounter at run-time. The solution at this point is to create a new and larger hash array.
Set a policy that if the hash occupancy ratio grows too large, then we will create a new hash array that's twice as large. The new hash array will have its own new hash function. We then re-hash all old key-value pairs into the new hash array.
For example, if the hash occupancy ratio grows to 1.0 or higher, then we stop and create a new hash array that's twice as large. We re-hash all key-value pairs into the new hash array, and then delete the old hash array.
What is the cost of all this re-hashing? Is it too expensive? Suppose we double the hash size 100 times (and re-hash each time)?
To get a handle on this, look at the latest hash array,
with 2100N
elements. Half of those elements,
(1/2)2100N
, were added as new elements in Stage 100.
Half of those elements, (1/2)2100N
, were added in an
earlier stage. So, half of all the elements have time cost 1,
1/4 of all the elements have time cost 2,
1/8 of all the elements have time cost 3, and so on.
We could estimate the cost of an average element this way, but there is a better way to analyze it. It requires amortized analysis.
Amortized analysis means that instead of charging the time cost of the current operation to the current element, we charge the time cost to some other object or operation.
In this case, we charge the cost of hashing not to the element that
was hashed, but to the particular hash array. The stage 100
hash array has 2100N' elements when it reaches
hash occupancy ratio 1.0. The previous stage 99 hash array
had
299N' elements when it reached hash occupancy ratio 1.0.
And so on.
So, the total cost of hashing (or re-hashing) new keys is:
2100N + 299N + 298N + … + N
The total is:
2100N (1+ 1/2 + 1/4 + 1/8 + … + 1/2100)
= ≈ 2100N 2
Since there are 2100N
keys and the time cost of hashing
them all is 2100N 2
, the average cost of hashing a new
key is only twice as much as if we had magically guessed the final
size of the hash array that we needed and we had never done any re-hashing.
We can avoid the space for the linked list pointer in the previous solution if we store key-value pairs in the hash array slot itself. This is also a little easier to implement.
In this case, if we have a hash collision, then we need a
secondary hash function to find a different, alternative and
unoccupied hash index. More generally, we need a probe
sequence or sequence of hash indices for probe attempt 0, 1, 2, …
.
Let h(k,i)
be the hash function depending on key k
and
probe attempt number i
. There are three common suggestions
for probe sequences:
h(k,i) = (h1(k) + i) mod N
h(k,i) = (h1(k)
+ c1i2
+ c2) mod N
h(k,i) = (h1(k) + ih2(k))
mod NThe sequence of hash indexes h(k,0), h(k,1), …, h(k,r)
is
known as a hash probe sequence of length r
.
The first probing technique is easiest to implement. Each successive technique is more efficient. The problem with linear probing is that if there is a long run (perhaps 100 occupied slots in a row), then upon adding a new key, a hash collision with any of those 100 slots will cause the 101-st slot will become occupied. So, a run of 100 occupied hash slots has ten times the probability that a new hash key will occupy the 101-st slot, as compared to occupying the 11-th slot of a run of length 10.
In open addressing, the hash occupancy ratio can never be more than 0;1.0. It is most efficient when the hash occupancy ratio is much less than 1.0. When the hash occupancy ratio becomes too large, one can re-hash into a larger hash array, as discussed earlier.
Unlike hashing with chaining, when using open addressing, there is no simple way to remove keys from a hash array. Instead, one sets the value of such a key to UNDEFINED. If one needs to remove many keys, then one is better off with hashing with chaining.
If we need to delete key-value pairs, then we create a special key, UNUSED. In order to delete a key-value pair, we replace the key by UNUSED. A slot with the key UNUSED is still occupied. So, once a slot becomes occupied, it will remain occupied forever. Think about why this last statement is necessary for correct operation of hash arrays.
Suppose we discover that a key k
is new for a hash array (perhaps
by probing until we found an unoccupied slot with no prior match on the
hash key). Suppose we encountered an UNUSED hash slot during the probe
sequence for k
. Then we can re-used the UNUSED hash slot for
the key k
.
Theorem: The algorithm for removing key-value pairs in open addressing is correct.
Proof:
As noted before, once a hash slot becomes occupied, it can never later become unoccupied.
Hence, the length of a probe sequence for a keyx
before encountering an unoccupied slot can only grow. The length can never decrease.
If a hash keyk
was first entered at thei
-th probe (if keyk
was entered at hash indexh(k,i)
), then the length of the probe sequence fork
will continue to be at leastk
. So,k
will continue to be found as long as it is present.
A perfect hash function is one that has no hash collisions. If we have a perfect hash function, we need only to store the value in the key-value pair. Not storing the key saves on storage.
In some applications, we do not have a perfect hash function, but we still want to save storage by saving only the value. Often we care only if the key is present, but we don't care about the value. So, a value of 1 indicates that the key is present, and the special value of 0 indicates that the entire key-value pair is unoccupied (not present).
In this case, we pretend that the hash function is perfect (even though it's not). When there is a hash collision, we make an error. We accept those errors, if they are not too frequent. So, if the hash array says that a key-value pair is unoccupied (not present), then we know that it really is not present. If the hash array says that a key-value pair is occupied (present), then probably it is occupied, but it might be unoccupied, in which case, we have a hash-collision with a different occupied hash key.
For an interesting application of this idea, which is often used in formal verification, see Bloom filters on Wikipedia.