Introduction
Hashing is a very common technique for storing data in such a way that the data can be inserted and retrieved very quickly. Hashing uses a data structure called a hash table. Although hash tables provide fast insertion, deletion and retrieval, operations that involve searching, such as finding the minimum or maximum value, are not done very quickly. For these types of operations, other data structures are preferred (like Binary Search and so on).
The .NET Framework library provides a very useful class for working with hash tables, the Hashtable class. We will examine this class in the article, but we will also discuss how to implement a custom hash table. Building hash tables is not very difficult and the programming techniques used are well worth knowing.
An Overview Of Hashing
A hash table data structure is designed around an array. The array consists of elements 0 through some predetermined size, though we can increase the size later if necessary. Each data item is stored in the array based on some piece of the data, called the key. To store an element in the hash table, the key is mapped into a number in the range of 0 to the hash table size using a function called a hash function. Even with a good hash function, as you have probably guessed by now, it is possible for two keys to hash to the same value. This is called a collision and we need to have a strategy for dealing with collisions when they occur.
Choosing A Hash Function
Choosing a hash function depends on the data type of the key you are using. If your key is an integer, the simplest function is to return the key modulo of the size of the array. There are circumstances when this method is not recommended, such as when the keys all end in zero and the array size is 10. This is one reason why the array size should always be prime. Also, if the keys are random integers then the hash function should more evenly distribute the keys.
In many applications, however, the keys are strings. Choosing a hash function to work with keys is more difficult and should be chosen carefully. A simple function that at first glance seems to work well is to add the ASCII values of the letters in the key. The hash value is that value modulo the array size.
The following program shows how this function works:
- using System;
- namespace ConsoleApplication1
- {
- class Program
- {
- static void Main(string[] args)
- {
- string[] names = new string[99];
- string name;
- string[] someNames = new string[]
- {
- "C#", "Asp", "Dotnet", "MVC", "Pearl", "Ruby", "Paython", "Java", "Javascript", "Json"
- };
- int hashVal;
- for (int i = 0; i < 10; i++)
- {
- name = someNames[i];
- hashVal = SimpleHash(name, names);
- names[hashVal] = name;
- }
- ShowDistrib(names);
- Console.ReadLine();
- }
- static int SimpleHash(string s, string[] arr)
- {
- int tot = 0;
- char[] cname;
- cname = s.ToCharArray();
- for (int i = 0; i <= cname.GetUpperBound(0); i++)
- tot += (int) cname[i];
- return tot % arr.GetUpperBound(0);
- }
- static void ShowDistrib(string[] arr)
- {
- for (int i = 0; i <= arr.GetUpperBound(0); i++)
- if (arr[i] != null) Console.WriteLine(i + " " + arr[i]);
- }
- }
- }
The ShowDistrib subroutine shows us where the names are actually placed into the array by the hash function. As you can see, the distribution is not particularly even. The names are clustered at the beginning of the array and at the end.
There is an even bigger problem lurking here, though. Not all of the names are displayed. Interestingly, if we change the size of the array to a prime number, even a prime lower than 99, all the names are stored properly. Hence, one important rule when choosing the size of your array for a hash table (and when using a hash function such as the one we're using here) is to choose a prime number.
The size you ultimately choose will depend on your determination of the number of records stored in the hash table, but a safe number seems to be 10,007 (given that you're not actually trying to store that many items in your table). The number 10,007 is prime and it is not so large that enough memory is used to degrade the performance of your program.
Sticking with the basic idea of using the computed total ASCII value of the key in the creation of the hash value, this next algorithm provides for a better distribution in the array.
First, let's look at the code, followed by an explanation:
- static int BetterHash(string s, string[] arr)
- {
- long tot = 0;
- char[] cname;
- cname = s.ToCharArray();
- for (int i = 0; i <= cname.GetUpperBound(0); i++)
- tot += 37 * tot + (int) cname[i];
- tot = tot % arr.GetUpperBound(0);
- if (tot < 0) tot += arr.GetUpperBound(0);
- return (int) tot;
- }
This function uses Horner's rule to compute the polynomial function (of 37). See Weiss 1999 for more information on this hash function.