Hash tables are always fascinated me in several ways. But maybe this is just my preference which is always obsessed with “performance-wise” in context programming, LoL
But, recently i just learned that hash tables performance can gradually decrease if a hash collision occurs (to put it simple, there are 2 different values that are set simultaneously in the same key). Although there’s some hash resolution method to avoid this, dynamically it’s quite challenging to avoid hash collisions when we’re talking about inserting pairs of key values with very large datasets. Something like, when we’re dealt with a set of indexing databases or even cached datasets of million records. Particularly, the degradation of performance itself depends on the load factor
The load factor can be interpreted as the number of elements that we will insert into a hash table and divided by the number of buckets1. Mathematically we can interpret it like this: let’s say you define the maximum size of hash tables which is 25 buckets, then you’ll insert 10 elements. From here we can calculate the load factor which is, 10 / 25 will result in a load factor of 0.4 (which is as cited by research, is not that bad because the acceptable number of load factor itself is at least in the range 0.6 - 0.75)
Now, if the resulting load factor is also high it’s likely that a hash collision will occur, and this is one of several reasons the performance of the hash tables themselves will decrease. Just like i said before, yes, we can avoid and handle hash collisions but practically it’s difficult to implement based on your own. Let’s take for example, how to avoid hash collisions with separate chaining (forgive me, i’d to choose this example because i think this is the easiest way to understand the concept, LoL)
I won’t go into detail about separate chaining, but i can simply explain it as: the separate chaining uses the linked list method to store different values in the same key. The way it works is also quite simple to implement :
void insert_hash_map(HashMap *map, int key, char *value) {
unsigned int index = hash(key);
KeyValuePair *pair = create_kv_pair(key, value);
if (map->buckets[index] == NULL) {
map->buckets[index] = pair;
} else {
KeyValuePair *current_pair = map->buckets[index];
while (current_pair != NULL) {
current_pair = current_pair->next;
}
current_pair->next = pair;
}
}If you look at the code above, it’s a sufficient linked list implementation. As long as the node (indicated by the current_pair variable) is not null or there are no empty elements, then it will look for the next node. Until the value is obtained until the tail (the very end / end of a linked list). Its very simple and yet straightforward implementation but here’s the thing (which may i also don’t really like it) :
-
The performance is awful, seriously. if the chaining is getting more increasing at the time being, then the process of searching will also be longer. The worst case, it might be that we also need to create chaining that exceeds the size of the bucket itself because of the many elements that we insert. It also doesn’t mention a condition, what if all of the same value is just stored in the same chaining (hell, it could be worse)
-
There’s space that is wasted during the chaining process. Assuming there are 5 chaining nodes, and we will look for the last element, the other 4 nodes will still be there. No matter what happens, unless we are fully aware and manually removing unused nodes
Alternative approaches?
But of course, there are many ways to prevent this from happening :
- first, keep the load factor to a minimum as possible. If possible, implement load factor dynamically after key distribution process based on hash functions division2
- another alternative is not to use separate chaining, such as open address, which uses a linear probing solution
Maybe this is all from me, thank you for the reading!
Footnotes
-
The bucket referred here is a pointer to the array that defined the linked list earlier. To be honest, from what i’ve read, you could say this is still an open question depending on “how to implement it and what to use” some people say it’s an array list. On the other hand some people say it’s an element of an array, so at this point i still can’t give a clear definition about it. But in this particular context, since i’m using C, the
HashMapstruct has buckets pointing its pointer to theKeyValuePairstruct whereas theKeyValuePairstruct has attributes key, value and next which are referenced as linked lists ↩ -
Here, i implemented a hash function using the division between the maximum number of buckets and the elements we will insert. To put it simple, if the result of dividing the two variables has no reminder or the result is exactly 0, then the distribution for the key so that it can match with the element that we will insert in the bucket will be messy or may undistributed properly → this is what causes also the occurrence of hash collisions. It’s worth noting that there are several ways to choose the hash functions calculations other than this method ↩