Hash TablesBy Eric Suh
Hash tables are an efficient implementation of a keyed array data structure, a structure sometimes known as an associative array or map. If you're working in C++, you can take advantage of the STL map container for keyed arrays implemented using binary trees, but this article will give you some of the theory behind how a hash table works.
Keyed Arrays vs. Indexed Arrays
One of the biggest drawbacks to a language like C is that there are no keyed arrays. In a normal C array (also called an indexed array), the only way to access an element would be through its index number. To find element 50 of an array named "employees" you have to access it like this:
In a keyed array, however, you would be able to associate each element with a "key," which can be anything from a name to a product model number. So, if you have a keyed array of employee records, you could access the record of employee "John Brown" like this:
One basic form of a keyed array is called the hash table. In a hash table, a key is used to find an element instead of an index number. Since the hash table has to be coded using an indexed array, there has to be some way of transforming a key to an index number. That way is called the hashing function.
A hashing function can be just about anything. How the hashing function is actually coded depends on the situation, but generally the hashing function should return a value based on a key and the size of the array the hashing table is built on. Also, one important thing that is sometimes overlooked is that a hashing function has to return the same value every time it is given the same key.
Let's say you wanted to organize a list of about 200 addresses by people's last names. A hash table would be ideal for this sort of thing, so that you can access the records with the people's last names as the keys.
First, we have to determine the size of the array we're using. Let's use a 260 element array so that there can be an average of about 10 element spaces per letter of the alphabet.
Now, we have to make a hashing function. First, let's create a relationship between letters and numbers:
A --> 0 B --> 1 C --> 2 D --> 3 ... and so on until Z --> 25.
The easiest way to organize the hash table would be based on the first letter of the last name.
Since we have 260 elements, we can multiply the first letter of the last name by 10. So, when a key like "Smith" is given, the key would be transformed to the index 180 (S is the 19 letter of the alphabet, so S --> 18, and 18 * 10 = 180).
Since we use a simple function to generate an index number quickly, and we use the fact that the index number can be used to access an element directly, a hash table's access time is quite small. A linked list of keys and elements wouldn't be nearly as fast, since you would have to search through every single key-element pair.
Collisions and Collision Handling
Problems, of course, arise when we have last names with the same first letter. So "Webster" and "Whitney" would correspond to the same index number, 22. A situation like this when two keys get sent to the same location in the array is called a collision. If you're trying to insert an element, you might find that the space is already filled by a different one.
Of course, you might try to just make a huge array and thus make it almost impossible for collisions to happen, but then that defeats the purpose of using a hash table. One of the advantages of the hash table is that it is both fast and small.
Collision handling with open addressing
The simplest collision handling algorithm is known as the open address method or the closed hashing method. When you are adding an element, say "Whitney," and you find that another element is already there ("Webster," for instance) then you would just proceed to the next element space (the one after "Webster"). If that is filled, you go on to the next one, and so on, until you find an empty space to insert the new element (all those extra elements came in handy after all!)
... 220 "White" | <-- ### COLLISION ### : Gotta move on to the next. 221 "Webster" | <-- ### COLLISION ### : Next one. 222 | Ahhh, perfect. Insert Here. 223 | ...Since we modified the insertion algorithm, we also have to change the function that finds the element. You have to have some way of verifying that you've found the element you want, and not some other element. The simplest way is to just compare keys. (Does this record have the last name "Whitney"? Does this one?) If the element you find is not one of them, just move on to the next element until you reach the one you want or you find an empty space (which means the element is not in the table).
Sounds simple, right? Well, it gets more complicated. What if you have so many collisions that you run off the end of the array?
If you're trying to insert "Zorba" and all the elements are filled because of the collision handling, then what? Look at the example:
... 258 "Whitney" | <-- Nope, not Empty 259 "Zeno" | Nope, not Empty ---------------- <-- Ummm, what now?
The easiest thing to do is to just wrap around to the beginning again. If there are still no empty spaces, then we have to resize the array, since there isn't enough space in the hash table for all of the elements. If we resize the array, of course, we'll have to come up with a tweak to our hash function (or at least how we handle it) so that it covers the right range of values again, but at least we'll have room. (Note that resizing the array means that occasionally inserting a value into the list will cause an O(n) copy operation to take place, but that on average this should happen only once for every n items inserted, so insertion should be on average constant time, O(1). (If you aren't sure what terms like "O(n") and "constant time" mean, take a look at the Cprogramming.com article series on algorithmic efficiency.) As you can see, resizing isn't all that bad--still, if you know the amount of space you will need to start with, you can save your program some work.
Handling collisions with separate chaining
A second collision handling strategy is to store a linked list at each element in the hash data structure. This way, when a collision occurs, you can just add the element into the linked list that is stored at the hash index. If you have only a single element with a particular hash value, then you have a single element list--no performance penalty. If you have a lot of elements hashing to the same value, you'll see a slowdown of course, but no more than you otherwise would see with hash collisions.
One nice thing about separate chaining is that having a bunch of values that hash "near" each other is less important. With open addressing, if you have a cluster of values that hash to nearly the same value, you'll run out of open space in that part of the hash. With separate chaining, each element that has a different hash value will not impact the other elements.
Resizing dynamically based on a load factor
Generally speaking, you wouldn't want your hash table to grow completely full because this will make lookups take much longer. If a value isn't in the array, with open addressing, you have to keep looking until you hit an empty location or you get back to the starting point--in other words, with a completely full table, lookups could be O(n), which is horrible. A real hash table implementation will keep track of its load factor, the ratio of elements to array size. If you have a 10 element array, with 7 elements, the load factor is 0.7. In fact, 0.7 is generally about the right time to resize the underlying array.
Choosing a Good Hash Algorithm
The more collisions you have, the worse the performance of your hash table will be. With enough elements in your hash table, you can get an average performance that's quite good--essentially constant time O(1). (The trick is to make the array grow over time as you start to fill up the array.) But if you have a lot of elements that hash to the same value, then you will have to start doing a lookup through a list of elements that all have the same hash value. This can make your hash lookups go from constant time to being, well, linear time in the number of elements. Imagine if your hash function hashed all values to 0, putting them in the first element of the array. Then it would be just a really complicated way of implementing a linear search.
Choosing a good hash algorithm can require some care and experimentation, and it will depend on your problem domain. If you're working with names, you probably don't want a hash algorithm that just looks at the first letter, because the letters of the alphabet are not used evenly--you'll find a lot more names that start with S than with Z. You also want to have your hash functions be fast--you don't want to lose all the time savings you're getting from the hash table because you're computing the hash function really slowly. It's a delicate balance. For one good hash function, check out this hash algorithm.
Now you're ready to implement your first hash table! Give it a try. It isn't too hard, and the end result is quite useful.Related articles
STL map, an associate array