Saturday, February 11, 2017

Polynomial hash function - Horner's method for calculating Hash code

There are various ways of computing hash code for strings. A naive way is to add the ASCII values of all the characters of a String. Doing so would lead to many collisions. Another way is to multiply the ASCII values of all characters in a string.
String in java uses polynomial hashing technique to compute hash code. The primary advantage of a polynomial hash function is that the computed hash code retains information about all the characters of String and thus avoids collisions.
The polynomial hash function used in Java String implementation is
For a String of size n,

h(x) = s[0]*R^(n-1) + s[1]*R^(n-2) + ... + s[n - 1]*R + s[n]  

For a String "abcd" and R = 31,
h(x) = 97*31^3 + 98*31^2 + 99*31 + 100          ---------------------------- 1

Here R can be any prime number. The reason for using 31 is two fold.
Firstly, when we mod the resultant hash code with any other number, there are no common factors. Moding the hash code with array size is pretty common operation in a hash table.
Secondly, 31 is a Mersenne prime, a prime that's one less than a two power. Multiplying such a prime is very efficient on any processor as it involves two ops only a shift and subtract. For eg. 31 * n = n << 5 - n.

To evaluate a polynomial, Horner's method could be employed.

Horners polynomial evaluation:
hash = 0;
String str = "abcd";

for(char ch : str.toCharArray()) {
    hash = hash*31 + ch;
}

This results in the same value as the formula #1 above does.
iteration#1
hash = 0*31 + 97

iteration#2
hash = 31(97) + 98

iteration#3
hash = 31(97*31 + 98) + 99

iteration#4
hash = 31(97*31^2 + 98*31 + 99) + 100

         => 97*31^3 + 98*31^2 + 99*31 + 100   (same as formula#1 above)

Since java supports 32-bit integers, for larger strings, the above formula might lead to integer overflow and thus result in negative hash codes. So if the hash code needs to be used to put values into a hash table backed by an array, the sign bit needs to be discarded. This is achieved by (hash & 0x7FFFFFFF)
0x7FFFFFFF(2^31-1) is the Integer.MAX_VALUE in java.
&ing with MAX_VALUE retains all the bits except for the highest order bit(32nd bit), which is the sign bit. So the negative integer is changed to a positive one. This is much more efficient and correct compared Math.abs(). Math.abs() might return a negative value for Integer.MIN_VALUE.





No comments:

Post a Comment