Hash Arrays (from Chapter 1.5.0 - 1.5.1)

For an introduction to hashing and its applications, see http://en.wikipedia.org/wiki/Hash_function. Unfortunately, the textbook dos not provide a thorough treatment of hashing. This note includes an algorithmic overview that includes some material not found in the textbook.

Terminology

Hash array
Hash index
Hash function
Key-Value pair
Hash occupancy ratio
Hash collision

Example

One has an interpreted language:
> x = 3 > y = x + 1 > foo = x + y > Print y 4 > bar = x + y > baz = 2*foo > Print foo 7

Naive way to store variable-value pairs (key-value pairs):

Variable	Value
x	3
y	4
foo	7
bar	7
baz	14

How can one find the value of the variable "foo" in almost constant time (i.e. without having to search through the entire table for a match with the name "foo"?)

Answer: Hash arrays.
The key is the variable name. (The key is a representation of the variable as a possibly very large integer corresponding to the bits of its ASCII rep.)
The hash function is a function from the key to a hash index. (For example, if the hash array has N entries, and if the key is k, then we might choose a hash function h(k) = k mod N. The function h(k) = k² mod N is better. Wikipedia has a discussion of good hash functions.)

Given a hash array, we can do three things (and optionally a fourth):

Enter a new key in the hash table
Read the value of an old key in the hash table
Modify the value of an old key in the hash table
[Optional] Remove a key from the hash table

Each time one encounters a variable corresponding to key k, one looks to see if the hash array at hash index h(k) is occupied. Let H[] be the hash array. If H[h(k)] is occupied, we can look at the location in the hash array and either retrieve the old value (example: Print foo ⇒ Print H[h(k)]) or set a new value (example: foo = 17 ⇒ H[h(k)] = 17).

Hash collisions

The hash function is chosen as a pseudo-random function, h(). Pseudo-random means that h() is a deterministic function, but from all tests, it appears to be random. A good pseudo-random function will hash two functions k₁ and k₂ to hash indexes that are far apart.

But sometimes, we are unlucky. If we are unlucky, then h(k₁) = h(k₂). This is called a hash collision. In this case, we have a problem. At H[h(k₁)]', do we store the value for the key k₁ or the value for the key k₂?



We need a solution.  Luckily, there are two traditional solutions.

Solution 1:  Linked Lists (Hashing with Chaining)

At each hash array slot, we store a linked list.  Each element of the
linked list contains a key-value pair.  More precisely, it contains
three fields:
1. Key  (for example, variable "foo", or the ASCII representation of the string)
2. Value (current value of the variable "foo")
3. Link pointer to next link entry 

Hash occupancy ratio:

We can easily keep track of the number of values in the hash array.  Increment
the number of elements each time we add a new key to the hash array.  The
hash occupancy ratio is the number of hash entries over the number
of hash array entries (or the number of hash indices).

As long as the hash occupancy ratio is small enough (and assuming we
use a good pseudo-random hash function that doesn't hash many elements
to the same hash index), the time to add a new key or access
an old key is almost constant.  This is a key property:


  A well-designed hash array will not use more than nearly constant
    time on average to enter a new key-value pair, access an old key-value,
    or remove an old key-value pair.


If the hash occupancy ratio becomes too large, the time to enter or
access a key grows.  This will happen if we did not know in advance the
total number of keys that we will encounter at run-time.  The solution
at this point is to create a new and larger hash array.

Example of re-hashing

Set a policy that if the hash occupancy ratio grows too large, then
we will create a new hash array that's twice as large.  The new
hash array will have its own new hash function.  We then
re-hash all old key-value pairs into the new hash array.

For example, if the hash occupancy ratio grows to 1.0 or higher, then
we stop and create a new hash array that's twice as large.  We re-hash
all key-value pairs into the new hash array, and then delete the
old hash array.

Amortized Analysis

What is the cost of all this re-hashing?  Is it too expensive?
Suppose we double the hash size 100 times (and re-hash each time)?

To get a handle on this, look at the latest hash array,
with 2¹⁰⁰N elements.  Half of those elements,
(1/2)2¹⁰⁰N, were added as new elements in Stage 100.
Half of those elements, (1/2)2¹⁰⁰N, were added in an
earlier stage.  So, half of all the elements have time cost 1,
1/4 of all the elements have time cost 2,
1/8 of all the elements have time cost 3, and so on.

We could estimate the cost of an average element this way, but there
is a better way to analyze it.  It requires amortized analysis.

Amortized analysis means that instead of charging the time cost
of the current operation to the current element, we charge the time cost
to some other object or operation.

In this case, we charge the cost of hashing not to the element that
was hashed, but to the particular hash array.  The stage 100
hash array has 2¹⁰⁰N' elements when it reaches
hash occupancy ratio 1.0.  The previous stage 99 hash array
had 2⁹⁹N' elements when it reached hash occupancy ratio 1.0.
And so on.


So, the total cost of hashing (or re-hashing) new keys is: 

2¹⁰⁰N + 2⁹⁹N + 2⁹⁸N + … + N 

The total is: 

2¹⁰⁰N (1+ 1/2 + 1/4 + 1/8 + … + 1/2¹⁰⁰) 

= ≈ 2¹⁰⁰N 2

Since there are 2¹⁰⁰N keys and the time cost of hashing
them all is 2¹⁰⁰N 2, the average cost of hashing a new
key is only twice as much as if we had magically guessed the final
size of the hash array that we needed and we had never done any re-hashing.

Solution 2:  In-Line Hashing (Hashing with Open Addressing)

We can avoid the space for the linked list pointer in the previous
solution if we store key-value pairs in the hash array slot itself.
This is also a little easier to implement.

In this case, if we have a hash collision, then we need a
secondary hash function to find a different, alternative and
unoccupied hash index.  More generally, we need a probe
sequence or sequence of hash indices for probe attempt 0, 1, 2, ….

Let h(k,i) be the hash function depending on key k and
probe attempt number i.  There are three common suggestions
for probe sequences:


Linear probing:  h(k,i) = (h₁(k) + i) mod N
Quadratic probing:  h(k,i) = (h₁(k)
             + c₁i2
             + c₂) mod N
Double hashing:  h(k,i) = (h₁(k) + ih₂(k)) mod N


The sequence of hash indexes h(k,0), h(k,1), …, h(k,r) is
known as a hash probe sequence of length r.

The first probing technique is easiest to implement.  Each successive
technique is more efficient.  The problem with linear probing is
that if there is a long run (perhaps 100 occupied slots
in a row), then upon adding a new key,
a hash collision with any of those 100 slots
will cause the 101-st slot will become occupied.  So, a run of
100 occupied hash slots has ten times the probability that
a new hash key will occupy the 101-st slot, as compared to
occupying the 11-th slot of a run of length 10.

In open addressing, the hash occupancy ratio can never be more than 0;1.0.
It is most efficient when the hash occupancy ratio is much less than 1.0.
When the hash occupancy ratio becomes too large, one can re-hash
into a larger hash array, as discussed earlier.

Unlike hashing with chaining, when using open addressing, there is
no simple way to remove keys from a hash array.  Instead, one
sets the value of such a key to UNDEFINED.  If one needs to remove
many keys, then one is better off with hashing with chaining.

Deleting key-value pairs using Open Addressing

If we need to delete key-value pairs, then we create a special
key, UNUSED.  In order to delete a key-value pair, we replace
the key by UNUSED.  A slot with the key UNUSED is still occupied.
So, once a slot becomes occupied, it will remain occupied forever.
Think about why this last statement is necessary for correct
operation of hash arrays.

Suppose we discover that a key k is new for a hash array (perhaps
by probing until we found an unoccupied slot with no prior match on the
hash key).  Suppose we encountered an UNUSED hash slot during the probe
sequence for k.  Then we can re-used the UNUSED hash slot for
the key k.

Theorem:  The algorithm for removing key-value pairs in open addressing
 is correct.

Proof:  


  As noted before, once a hash slot becomes occupied, it can
       never later become unoccupied. 

     Hence, the length of a probe sequence for a key x before
      encountering an unoccupied slot can only grow.  The length
      can never decrease. 

    If a hash key k was first entered at the i-th probe
      (if key k was entered at hash index h(k,i)),
      then the length of the probe sequence for k
      will continue to be at least k.  So, k will
      continue to be found as long as it is present.


Perfect Hash Functions

A perfect hash function is one that has no hash collisions.  If we
have a perfect hash function, we need only to store the value in the
key-value pair.  Not storing the key saves on storage.

In some applications, we do not have a perfect hash function, but we
still want to save storage by saving only the value.  Often we care
only if the key is present, but we don't care about the value.
So, a value of 1 indicates that the key is present,
and the special value of 0 indicates that the entire key-value pair
is unoccupied (not present).

In this case, we pretend that the hash function is perfect (even though
it's not).  When there is a hash collision, we make an error.  We accept
those errors, if they are not too frequent.  So, if the hash array
says that a key-value pair is unoccupied (not present), then we know
that it really is not present.  If the hash array says that a key-value
pair is occupied (present), then probably it is occupied, but
it might be unoccupied, in which case, we have a hash-collision with
a different occupied hash key.

For an interesting application of this idea, which is often
used in formal verification, see
Bloom filters on Wikipedia.