Hash Tables

Are a form of maps common in compilers for symbol tables (tables storing variable's names and address) and operating systems for registry of environment variables (variables defining the shell options).

Components of a Hash Table

Bucket Arrays

  1. An array A with size N

  2. If keys are integers then insert will put entry e in array rank k, A[k] = e, where entry = (key, value). We could then recover the value given k, with get(k) by accessing A[k].

  3. Any bucket with array cell not assigned an element should contain a special NO_SUCH_KEY object. Naturally when the bucket array is first created with no keys assigned all cells contain NO_SUCH_KEY


The map methods get(k) and insert(k,e)  should take O(1).

There are problems:

If there are many keys or if N is too small then a collision can occur.  We have to take care of this.  If the keys are not integers then we need a function form keys to integers: a hash function.

Hash Functions

A hash function maps general keys to a finite range of integers.

Hash function: general keys --> finite set of integers

The hash function has two parts:

  1. Hash code,  maps general keys to integers.

    Hash code: general keys --> Whole range of integers (Not really.)

  2. Compression map,  maps the infinite set of integers to a finite range of integers.

    Compression map: Whole range of  integers --> finite range of integers

Hash Codes

We like for the hash code to avoid collisions, but this is not always possible.  Collisions are when distinctly different keys map to the same hash values.  Why is this possible?  Collisions are unavoidable because:

  1. finite set of integers even if we use int

  2. algorithm may get too complicated otherwise

We require that if keys are equal then the hash code should return the same integer value.  If hash code function is h, then:

if k1 = k2 then h(k1) = h(k2)

For efficiency of the hash table we would like for the hash code to spread out the mapping uniformly across the integers.

 Methods for Hashing

  1. Casting to Integer:

    This works great for int and char, but not very well for doubles or Strings.  Why? Chopping would give the same value for the hashcode.

  2. Summing components:

    static int hashCode(long i) { return (int)((i >> 32) + (int) i):}  // good for long

      In general Sum(xk) for all k where xk are the components.

      Good for longs mixes significant an insignificant parts, collisions still occur.  Not good for Strings. Why?  Hint programmers like to use similar variable names.

  3. Polynomial hash codes:

    Assume the key is of the form of a Object with multiple components, (x0, x1, x2, ..., xk-1), for example a String, the individual objects characters.  The polynomial:
     

    x0ak-1 + x1ak-2 + ... + xk-2a + xk-1

      where a is a constant of the hash code, mixes all the components.  Horner's form is more efficient to use:
       

    xk-1 + a(xk-2 + a(xk-3 + ... + a(x2 + a(x1 + ax0)) ... ))

      This is a very common hash code.  The author has checked for the number of collisions for 50,000 English words as a function of the  constants a.  They have found that a = 33, 37, and 41 have less then 7 collisions.  Java uses a = 31, which I suspect also gives low collision rates.

  4. Cyclic shift hash codes:

    A variant of the  polynomial hash code but replaces multiplication of a with cyclic shifts.  Again mixes the components.

Compression Maps

We need to compress the hash value so we can find the proper bucket in a finite array.

Methods for Compressing

  1. The Division Method:

    h(k) = |k| mod N

      N, the size of the array,  should be a prime number so that it spreads out the generated codes for typical sequence of integers.  Consider N=10 and sequence 10, 20 , 30, ... or 5, 10, 15, 20, ...

      We want to different integers two have probability 1/N for generating a bucket value.
       

  2. The Mad Method:

    Multiply-Add-Divide

      h(k) = |ak + b| mod N