Tech Blog: Why are PrimeNumbers used for computing modular hash codes ?

One of the objectives of a hash function is to uniformly distribute keys among the available buckets and avoid collisions to the possible extent. While every hash function starts out with this objective, very few hash functions are able to achieve near uniform distribution. Collisions cannot be avoided but definitely can be minimized.
Modular hashing is a popular hashing technique for generating hash codes of positive integers. The hash code of a positive integer is calculated as h(x) = x%m where m is the number of buckets/slots available.

For m = 10 and an universe of first hundred positive integers(keys), h(x) would uniformly distribute the keys to the available 10 buckets.

The data is uniformly distributed across 10 buckets with 10 keys each.
Real world data is seldom uniformly distributed. Data is usually biased or skewed. In such cases, this hash function doesn't uniformly distribute the data. A good hash function uniformly distributes any type of data.

Now assume that the data is skewed and the data primarily consists of multiples of 5. The modular hash function h(x) does a poor job of distributing the data.

Buckets 0 and 5 have bulk of the data while other buckets are largely empty. Why so ? The multiples of 5 and m have a common factor in 5 and so many of them hashed to bucket#5. The other multiples of 5 which are also multiples of 10 hashed to bucket#0. This is unavoidable. The multiples of m will hash to bucket#0 for any value of m.
(it would be interesting to perform the same exercise for m = 12 and keys that are multiples of 3)
In order to make the distribution more uniform, we need to reduce the number common factors between m and the keys. How do we do so ? by choosing a m, which is prime.

Now lets use the same hash function with a m value that's prime. h(x) = x%m m = 7. I chose 7 as its the nearest prime that's less than 10. 11 is also a good choice, but the number of buckets would be more.

For m = 7, the data is almost uniformly distributed.

Now for the same h(x), m = 7 and universe of first 100 positive integers, assume that the data is skewed towards multiples of 3.

Again h(x) does a good job of distributing the multiples of three.

In conclusion, the distribution of data depends both on h(x), m and the data. Since the input data can be skewed or non-uniformly distributed, we need to work on our hash function and m in order to achieve a uniform distribution among available buckets. In the above examples, I have played around with value of m to achieve a near uniform distribution of data. While 7 is not the best prime number to distribute the keys uniformly, it suffices this particular use case.
Finally, why prime numbers ? Every key that has a common factor with m, will hash to that common factor, thus skewing our distribution of keys. The number of common factors a key can have with m can be reduced by picking a m that's prime.

References:
1. http://cs.stackexchange.com/questions/11029/why-is-it-best-to-use-a-prime-number-as-a-mod-in-a-hashing-function
2. http://algs4.cs.princeton.edu/34hash/

Tech Blog

Friday, February 10, 2017

Why are PrimeNumbers used for computing modular hash codes ?

No comments:

Post a Comment