Java HashMap is an implementation of the java Map interface. It allows for storing key-value pairs. There are several implementations of the Map interface, for instance, LinkedHashMap and TreeHashMap. HashMap is the most common one and it is internally implemented as a hash table data structure. A hash table is what we can also call an associative array or dictionary.
Java HashMap Internals
An array is made of a contiguous set of values in memory. All the elements have the same type and can be accessed randomly by an index, with maximum performance. They have a fixed size that must be specified when declared though, and this is a limitation.
Linked lists are more complex. They can be made by non-contiguous elements in memory and their size can grow dynamically. Elements of a linked list are made of at least two parts, the first to store a value in the form of a generic object, and the second to hold a pointer to the next element of the list.
A pointer with a null value means that there are no further elements. A linked list element may also contain two pointers, and in that case, the list can be traversed in both directions.
Unlike arrays, access to linked lists cannot be done randomly. To find an element in the list we must traverse it step by step, beginning from the first.
A hash table is a mix of the above two data structures and allows us to overcome the limitations of each. Hash tables are based on an array and they make use of a so-called hash function. A hash function is passed an object meant to be a key and returns an integer. The returned integer also called the hash value is then used to calculate the index of the hash table array in which to put the key and an associated object, i.e. the value part in the key-value entry. Each hash table position in which a key-value entry is stored is also called a “bucket”.
Anyway, when does the linked list part come into play? We have said that a bucket is supposed to store a key-value entry, but in reality, what a bucket really contains is a pointer to a linked list: such a linked list could contain one or more entries.
The hash function will always return the same value for equal keys. This is essential to allow retrieving an object from the hash table. Different keys on the other hand may return an identical hash value, and this situation is called a “collision”. As a matter of fact, the linked list part of the hash table is necessary because of possible collisions.
We can see in the picture below how this works. Two keys, key 1 and key 2, are passed to the hash function. The hash function returns the same index (bucket number 1). The two entries related to key 1 and key 2 are stored in the linked list associated with bucket 1 one after the other as entry points 1 and 2.
To get entry number 2 the hash function will calculate the value to generate the index for key 2, then from the hash value the hash table’s bucket will be reached and the corresponding linked list will be traversed until the element with key 2.
The linked lists approach, called “separate chaining” is not the only one we can think of. For example, there is another strategy called “linear probing“: in this case, the entry is put on the next available position in the hash table. This approach has a downside: the so-called “clustering”, i.e, growing groups of contiguously filled positions. This has a detrimental effect on performance.
There is a quantity that represents a measure of Hash table performance and is called the Load Factor. It is calculated as the number of entries divided by the number of buckets. If the load factor gets to a certain value the performances will be progressively worse. With a low LF and with the hash values, calculated by the hash function, distributed uniformly in the table, the possibility of collisions is minimized and the performance is high.
Java HashMap Methods
The Java HashMap main methods are:
- put(K key, V value)
- get(Object key)
Its constructors allow specifying an initial capacity and even a customized load factor (other than 0.75, which is the default).
The method put() is passed a key and value objects, creates an Entry object with them, and adds it to a linked list corresponding to a bucket index calculated by the hash function. If we use the put method a number of times and there are no collisions, each bucket will point to a linked list with only one entry (an ideal situation).
When we create a HashMap instance we have the choice to set an appropriate initial capacity to avoid its resizing when the LF reaches its threshold, as we will see in the next section.
In order to play the role of a map key a generic class must override the Object hashCode() method. The hashCode method essentially implements the hash table hash function. The hashCode() implementation is well-designed when the hash values are distributed uniformly.
Actually, the index of the bucket in which to put the entry is not calculated directly by the hashCode method. Java HashMap calculates an improved value from that returned from hashCode, by an internal function, that is supposed to spread better, and which is eventually normalized to the targeted index of the bucket.
The method get() uses the hashCode() function to calculate the hash value from the key object and the array index from it. It then traverses the linked list corresponding to that index from the beginning and uses the equals() method to match the wanted key.
Here is the signature of the equals method:
- public boolean equals(Object obj)
the right implementation of it returns true if the comparing objects have the same internal state.
Re-sizing and Re-hashing
The hash table data structure is based on an array, and the array has a fixed size, which represents its so-called capacity. When the LF reaches a certain value any new elements will not be spread uniformly and the performances will decrease progressively. Once the HashMap instance reaches its LF threshold it will be re-sized to replace the old array with a new one twice the size, and the hashCode function will be used again to place the objects in new bucket locations. This procedure is called re-hashing.
Each linked list pointed to by its original bucket will be then assigned a bucket in the new re-sized table. In order to avoid the extra work involved in traversing the list, in the re-hashing operation its elements will be taken one by one from the head and so they will be in reverse order.
We can see a visual representation of the above in the following picture:
The Java HashMap class is not ideal for concurrent executions, i.e is not thread-safe. To avoid this problem we are forced to synchronize the access from outside or use some other, thread-safe, implementation. We have basically these choices:
The HashTable option, which is a legacy of the earlier versions of Java, is slower than the ConcurrentHashMap, because of its internal model of synchronization.
We have seen in this article that HashMap makes use of array and linked list structures internally and that gives us a rough idea of how it could perform in various scenarios. We learned that an intelligent choice of the initial capacity, for instance, could be important to avoid a re-resizing. We have also seen that HashMap is not thread-safe and in concurrent scenarios could be replaced by ConcurrentHashMap.