Hash table dictionary -- extra credit

Idea of a Hash Table

In this problem you will implement a dictionary using a hash table. The idea of a hash table is very simple, and decidedly hackish.

We are going to write a hash table to store strings, but the same idea can be adapted to any other datatype. We write a function (the hash function) that takes a string and "hashes" it, i.e. returns an integer (possibly large) that is obtained by manipulating the bytes of the input string. For a good hash function all bytes of the input string must affect the output, and the dependance of the output on the strings should be non-obvious. Here is an example of a good hash function (actual library code), which uses the fact that a character is interpreted as a small integer:

unsigned long hash_string(char* s)
  unsigned long h = 0; 
  for ( ; *s; ++s)  // s is a zero terminated string, hence the loops exits
    h = 5*h + *s;   //  when the terminating zero byte is reached, *s == NULL
  return h;
For the C++ class strings the same function will look like this:
unsigned long hash_string( const string& s )
  unsigned long h = 0; 
  for ( int i=0 ; i < s.length(); i++)  
    h = 5*h + s[i];   
  return h;

With some such hash function, we try the following idea. Declare an array of length N and, for any string str, store the string in the

hash_string(str) % N
slot in the array. Since the results of hashing a string appear random (but, of course, the result for any given string is always the same) the above number for the slot to store the string in will hopefully be distributed uniformly through the array. If we want to check if a string is already stored in our array, we hash the string and look at the slot determined by the above formula. The best distribution of indices for the above hash function has been observed to be with the following values of N:
  53,         97,           193,         389,       769,
  1543,       3079,         6151,        12289,     24593,
  49157,      98317,        196613,      393241,    786433,
  1572869,    3145739,      6291469,     12582917,  25165843,
  50331653,   100663319,    201326611,   402653189, 805306457, ...

It may happen that two different strings hash to the same slot number, collide. If N is large enough, there won't be many such collisions. This difficulty can be resolved as follows: our array will be an array of lists of strings instead of indivudual strings. These lists are called buckets in some books. Thus the algorithm for adding a string to the hash table is:
For a string str:

  1. Apply the hash function to str, take mod by N. This is the index in the hash array.
  2. Check every element of the linked list (bucket) at that index. If str is not already present, add it to the linked list.
Analogously, checking is a string is already in the table is done as follows:
For a string str:
  1. Apply the hash function to str, take mod by N. This is the index in the hash array.
  2. Check every element of the linked list (bucket) at that index. If str is found, return true, else return false.

Why use Hash Tables?

Imagine a dictionary implemented as one long linked list (of strings). Finding a word in it may require going all the way to the end of the list, checking every element for equality with the search string. This is very inefficient and slow.

Now look at the same process for the hash table. Hashing a string does not take long (see hash_string), and then we know the right bucket right away. Of course, the bucket is a linked list and has to be searched by comparing every element in it to the search string, but if the index obtained by hashing is roughly uniformly distributed across the array of buckets, the buckets should not be long (on average, a bucket will contain TOTAL_WORDS_STORED / N elements).

Thus a hash table is much better than an array or a likned list, because we zero in on the right bucket by hashing, and then have only a few words that collided at that bucket's index to check. All kinds of serious applications use hash tables, including C++ compilers that keep the names of defined variables, functions and classes in one.

Exercise (Problem D)

On UNIX systems there is a file that contains all frequently used words of the English language, usually /usr/dict/words. Here is one, zipped: words.zip. Write your hash table for strings, and fill it in by reading that file. Something like this will do:

  ifstream f( "words" );
  if( ! f ) {
     cerr << "Error opening words file\n";
  string s;
  HashTable dict;
  while( f >> s )   // f >> s fills s with a new word and returns true,
                    // or fails (when at eof) and returns false
     dict.Add( s );

There are 45402 words in my file, so reasonable choices for hash table size are 12289, 24593, and 49157. You can make you class keep statistics of the largest and the average number of collisions (i.e. sizes of buckets).

Then, in the same manner as above, read in a text file and print out all words that are possible spelling errors. Write a member function bool HashTable::Contains(const string & ). There is a Unix program ispell that does something like that.

Pesky technical problems

Notice that from the point of view of the program above, any sequence of non-whitespace characters is a word. Whitespace characters (space, newline and tab) are seen as word separators and skipped. Thus you need to remove any punctuation marks (except apostrophes) from your file being spell checked, before you can run the program on your text, so that ' "end." , "end!", "end?" and "end" are not different words. This can be done by another small program, or by search-and-replace in any good text editor. There is also the issue of capital vs. lowercase letters: "And" and "and" should be the same word. This issue, unfortunately, cannot be resolved with any text editor (other than Emacs) that I know of. I can pre-process for you any text file, as follows:

Both of these preparatory tasks can be carried out with just two commands of the UNIX operating system:

  tr -c '[A-Za-z\n]'  ' '   < text-fragment.txt > text-proc.0
  tr     'A-Z'  'a-z'       < text-proc.0       > text-proc.txt
The first command changes any character which is not a letter in the ranges A-Z or a-z or a newline \n into a space. The second command changes each capital letter to the corresponding lowercase one. See man tr on your UNIX system for more information. Nothing like that on MS Windows or MacOS.
Sergey Bratus
Last modified: Fri May 7 10:06:14 EDT 1999