Hash Arrays (from Chapter 1.5.0 - 1.5.1)

Notes by Gene Cooperman, © 2009 (may be freely copied as long as this copyright notice remains)

For an introduction to hashing and its applications, see http://en.wikipedia.org/wiki/Hash_function. Unfortunately, the textbook dos not provide a thorough treatment of hashing. This note includes an algorithmic overview that includes some material not found in the textbook.

Terminology

Hash array
Hash index
Hash function
Key-Value pair
Hash occupancy ratio
Hash collision

Example

One has an interpreted language:
> x = 3
> y = x + 1
> foo = x + y
> Print y
    4
> bar = x + y
> baz = 2*foo
> Print foo
    7

Naive way to store variable-value pairs (key-value pairs):

Variable   Value
x 3
y 4
foo 7
bar 7
baz 14

How can one find the value of the variable "foo" in almost constant time (i.e. without having to search through the entire table for a match with the name "foo"?)

Answer: Hash arrays.
The key is the variable name. (The key is a representation of the variable as a possibly very large integer corresponding to the bits of its ASCII rep.)
The hash function is a function from the key to a hash index. (For example, if the hash array has N entries, and if the key is k, then we might choose a hash function h(k) = k mod N. The function h(k) = k2 mod N is better. Wikipedia has a discussion of good hash functions.)

Given a hash array, we can do three things (and optionally a fourth):

  1. Enter a new key in the hash table
  2. Read the value of an old key in the hash table
  3. Modify the value of an old key in the hash table
  4. [Optional] Remove a key from the hash table

Each time one encounters a variable corresponding to key k, one looks to see if the hash array at hash index h(k) is occupied. Let H[] be the hash array. If H[h(k)] is occupied, we can look at the location in the hash array and either retrieve the old value (example: Print fooPrint H[h(k)]) or set a new value (example: foo = 17H[h(k)] = 17).

Hash collisions

The hash function is chosen as a pseudo-random function, h(). Pseudo-random means that h() is a deterministic function, but from all tests, it appears to be random. A good pseudo-random function will hash two functions k1 and k2 to hash indexes that are far apart.

But sometimes, we are unlucky. If we are unlucky, then h(k1) = h(k2). This is called a hash collision. In this case, we have a problem. At H[h(k1)]', do we store the value for the key k1 or the value for the key k2?

We need a solution. Luckily, there are two traditional solutions.

Solution 1: Linked Lists (Hashing with Chaining)

At each hash array slot, we store a linked list. Each element of the linked list contains a key-value pair. More precisely, it contains three fields: 1. Key (for example, variable "foo", or the ASCII representation of the string) 2. Value (current value of the variable "foo") 3. Link pointer to next link entry

Hash occupancy ratio:

We can easily keep track of the number of values in the hash array. Increment the number of elements each time we add a new key to the hash array. The hash occupancy ratio is the number of hash entries over the number of hash array entries (or the number of hash indices).

As long as the hash occupancy ratio is small enough (and assuming we use a good pseudo-random hash function that doesn't hash many elements to the same hash index), the time to add a new key or access an old key is almost constant. This is a key property:

A well-designed hash array will not use more than nearly constant time on average to enter a new key-value pair, access an old key-value, or remove an old key-value pair.

If the hash occupancy ratio becomes too large, the time to enter or access a key grows. This will happen if we did not know in advance the total number of keys that we will encounter at run-time. The solution at this point is to create a new and larger hash array.

Example of re-hashing

Set a policy that if the hash occupancy ratio grows too large, then we will create a new hash array that's twice as large. The new hash array will have its own new hash function. We then re-hash all old key-value pairs into the new hash array.

For example, if the hash occupancy ratio grows to 1.0 or higher, then we stop and create a new hash array that's twice as large. We re-hash all key-value pairs into the new hash array, and then delete the old hash array.

Amortized Analysis

What is the cost of all this re-hashing? Is it too expensive? Suppose we double the hash size 100 times (and re-hash each time)?

To get a handle on this, look at the latest hash array, with 2100N elements. Half of those elements, (1/2)2100N, were added as new elements in Stage 100. Half of those elements, (1/2)2100N, were added in an earlier stage. So, half of all the elements have time cost 1, 1/4 of all the elements have time cost 2, 1/8 of all the elements have time cost 3, and so on.

We could estimate the cost of an average element this way, but there is a better way to analyze it. It requires amortized analysis.

Amortized analysis means that instead of charging the time cost of the current operation to the current element, we charge the time cost to some other object or operation.

In this case, we charge the cost of hashing not to the element that was hashed, but to the particular hash array. The stage 100 hash array has 2100N' elements when it reaches hash occupancy ratio 1.0. The previous stage 99 hash array had 299N' elements when it reached hash occupancy ratio 1.0. And so on.

So, the total cost of hashing (or re-hashing) new keys is:
2100N + 299N + 298N + … + N
The total is:
2100N (1+ 1/2 + 1/4 + 1/8 + … + 1/2100)
= ≈ 2100N 2

Since there are 2100N keys and the time cost of hashing them all is 2100N 2, the average cost of hashing a new key is only twice as much as if we had magically guessed the final size of the hash array that we needed and we had never done any re-hashing.

Solution 2: In-Line Hashing (Hashing with Open Addressing)

We can avoid the space for the linked list pointer in the previous solution if we store key-value pairs in the hash array slot itself. This is also a little easier to implement.

In this case, if we have a hash collision, then we need a secondary hash function to find a different, alternative and unoccupied hash index. More generally, we need a probe sequence or sequence of hash indices for probe attempt 0, 1, 2, ….

Let h(k,i) be the hash function depending on key k and probe attempt number i. There are three common suggestions for probe sequences:

  1. Linear probing: h(k,i) = (h1(k) + i) mod N
  2. Quadratic probing: h(k,i) = (h1(k) + c1i2 + c2) mod N
  3. Double hashing: h(k,i) = (h1(k) + ih2(k)) mod N

The sequence of hash indexes h(k,0), h(k,1), …, h(k,r) is known as a hash probe sequence of length r.

The first probing technique is easiest to implement. Each successive technique is more efficient. The problem with linear probing is that if there is a long run (perhaps 100 occupied slots in a row), then upon adding a new key, a hash collision with any of those 100 slots will cause the 101-st slot will become occupied. So, a run of 100 occupied hash slots has ten times the probability that a new hash key will occupy the 101-st slot, as compared to occupying the 11-th slot of a run of length 10.

In open addressing, the hash occupancy ratio can never be more than 0;1.0. It is most efficient when the hash occupancy ratio is much less than 1.0. When the hash occupancy ratio becomes too large, one can re-hash into a larger hash array, as discussed earlier.

Unlike hashing with chaining, when using open addressing, there is no simple way to remove keys from a hash array. Instead, one sets the value of such a key to UNDEFINED. If one needs to remove many keys, then one is better off with hashing with chaining.

Deleting key-value pairs using Open Addressing

If we need to delete key-value pairs, then we create a special key, UNUSED. In order to delete a key-value pair, we replace the key by UNUSED. A slot with the key UNUSED is still occupied. So, once a slot becomes occupied, it will remain occupied forever. Think about why this last statement is necessary for correct operation of hash arrays.

Suppose we discover that a key k is new for a hash array (perhaps by probing until we found an unoccupied slot with no prior match on the hash key). Suppose we encountered an UNUSED hash slot during the probe sequence for k. Then we can re-used the UNUSED hash slot for the key k.

Theorem: The algorithm for removing key-value pairs in open addressing is correct.

Proof:

As noted before, once a hash slot becomes occupied, it can never later become unoccupied.
Hence, the length of a probe sequence for a key x before encountering an unoccupied slot can only grow. The length can never decrease.
If a hash key k was first entered at the i-th probe (if key k was entered at hash index h(k,i)), then the length of the probe sequence for k will continue to be at least k. So, k will continue to be found as long as it is present.

Perfect Hash Functions

A perfect hash function is one that has no hash collisions. If we have a perfect hash function, we need only to store the value in the key-value pair. Not storing the key saves on storage.

In some applications, we do not have a perfect hash function, but we still want to save storage by saving only the value. Often we care only if the key is present, but we don't care about the value. So, a value of 1 indicates that the key is present, and the special value of 0 indicates that the entire key-value pair is unoccupied (not present).

In this case, we pretend that the hash function is perfect (even though it's not). When there is a hash collision, we make an error. We accept those errors, if they are not too frequent. So, if the hash array says that a key-value pair is unoccupied (not present), then we know that it really is not present. If the hash array says that a key-value pair is occupied (present), then probably it is occupied, but it might be unoccupied, in which case, we have a hash-collision with a different occupied hash key.

For an interesting application of this idea, which is often used in formal verification, see Bloom filters on Wikipedia.