Hash table is probably the most commonly used data structure in software industry. Most implementations focus on its speed instead of memory usage, yet small memory footprint has significant impact on large inmemory tables and database hash indexes.
In this post, I”ll provide a step by step guide for writing a modern hash table that optimize for both speed and memory efficiency. I’ll also give some mathematical bounds on how well the hash table could achieve, and shows how close we are to the optimal.
Let me start with a disclaimer. I now work at google, and this project (OPIC including the hash table implementation) is approved by google Invention Assignment Review Committee as my personal project. The work is done only in my spare time with my own machine and does not use and/or reference any of the google internal resources.
Common hash table memory usages
As mentioned earlier, most hash hash table focus on its speed, not memory usage. Consequently there’s not much benchmark compares the memory these hash table implementation consumes. Here is a very basic table for some high performance hash table I found. The input is 8 M keyvalue pairs; size of each key is 6 bytes and size of each value is 8 bytes. The lower bound memory usage is $(6+8)\cdot 2^{23} =$ 117MB . Memory overhead is computed as memory usage divided by the theoretical lower bound. Currently I only collect 5 hash table implementations. More to be added in future.
Memory Usage  Memory Overhead  Insertion Time  Query Time  

std::unordered_map  588M  5.03x  2.626 sec  2.134 sec 
sparse_hash_map  494M  4.22x  7.393 sec  2.112 sec 
dense_hash_map  1280M  10.94x  1.455 sec  1.436 sec 
libcuckoo  708M  6.05x  2.026 sec  2.120 sec 
klib khash  642M  5.48x  4.232 sec  1.647 sec 
The metrics above actually surprises me. For example, [sparse hash map][shm] is advertised to use 410 bits per entry, but the overhead is actually 4 times the lower bound. If the hash table were implemented as large keyvalue store index, and you have 1 TB of data, you’ll need at least 45TB of space to hold the data. That’s not very space efficient. Can we do better?
Overview of hash table types
There’s two major types of hash table, one is chaining and the other is open addressing. Chaining is quite common in most standard libraries, where the collision is handled by appending items into a linked list headed by the bucket the key is mapped to. Open addressing uses a different mechanism to handle collision: the key (and value) is inserted to another bucket if the bucket it attempt to insert is already occupied.
Open addressing has some clear advantages over chaining. First, it does not require extra memory allocation. This reduces memory allocation overhead and can possibly improve cpu caching. Moreover, in open addressing the developer has more control on memory layout – placing elements in buckets with certain order to make probing (search on alternative location for key) fast. Best of all, open addressing gives us better memory lower bound over chaining.
The hash collision rate affects the chaining memory usage. Given a hash table with $N$ buckets, we insert $M$ elements into the table. The expected collision number in the table is $M(1  (1  1/N)^{M1})$. For a table with 1000 buckets the expected collisions under high loads ($M/N > 80%$) are:
 80% > 440
 90% > 534
 100% > 632
Accounting the extra payload that chaining requires, we can now compute the lower bound for the overhead under different loads.
load  Chaining  Open Addressing 

100%  1.31x  1.00x 
90%  1.37x  1.11x 
80%  1.47x  1.25x 
70%  1.60x  1.42x 
50%  2.09x  2.00x 
25%  4.03x  4.00x 
Here I assume if the collision rate were 60%, half of it is chained and half of it fits the buckets. The actual number may have some digits off, but it doesn’t change my conclusion on choosing open addressing for hash table implementation.
Probing methods
In open addressing, hash collisions are resolved by probing, a search through alternative buckets until the target record is found, or some failure criteria is met. The following all belongs to some kinds of probing strategies:
 Linear Probing
 Quadratic Probing
 Double Hashing
 Hopscotch Hashing
 Robin Hood Hashing
 Cuckoo Hashing
For each of the probing method, we’re interested in their worst case and average case probing numbers, and is their space bound.
Linear Probing and Quadratic Probing
Linear probing can be represented as a hash function of a key and a probe number $h(k, i) = (h(k) + i) \mod N$. Similarly, quadratic probing is usually written as $h(k, i) = (h(k) + i^2) \mod N$. Both methods has worst case probing count $O(N)$, and are bounded on space usage. In other words, there no condition where we need to increase the bucket count and rehash.
Double hashing
Double hashing can be written as $h(k, i) = (h1(k) + i \cdot h2(k)) \mod N$. Same as linear probing and quadratic probing, it has worst case probing count $O(N)$, and is bounded on space usage.
Hopscotch Hashing
Here is the algorithm copied from wikipedia. This is how the collision is handled
If the empty entry’s index j is within H1 of entry i, place x there and return. Otherwise, find an item y whose hash value lies between i and j, but within H1 of j. Displacing y to j creates a new empty slot closer to i. If no such item y exists, or if the bucket i already contains H items, resize and rehash the table.
This mechanism has a good worst case probing number $O(H)$. However, since it could resize the hash table, the hash table size is unbounded.
Robin Hood Hashing
The concept for robin hood hashing is simple and clever. When a collision occur, compare the two items’ probing count, the one with larger probing number stays and the other continue to probe. Repeat until the probing item finds an empty spot. For more detailed analysis checkout the original paper. It’s worth to read.
The expected probing length is
Even under a high load, we still get very good probing numbers. The best thing about robin hood hashing is it does not expand the hash table, which is important because we want to build a hash table with bounded size. This is the probing strategy I chose.
Cuckoo hashing
The following description is also copied from wikipedia.
It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is rehashed with the second hash function to map it to another bucket.
The expected probing number is below 2. However, the load factor has to be below 50% to achieve good performance. For using 3 hash functions, the load can increase to 91%. Combining linear/quadratic probing with cuckoo, the load factor can go beyond 80%. (All numbers comes from wikipedia).
Optimizing Division for Hash Table Size
I implemented a robin hood hashing prototype a month ago. The prototype
satisfy the low memory footprint, but hard to get it fast. The major
reason is the modulo operation is very slow on most platforms. For example,
on Intel Haswell the div
instruction on 64bit integer can take 3296
cycles. Almost all major hash implementation use power of 2 table size,
so that the modulo is just one bitwise and operation. The problem with
power of 2 table size is it scales too fast! If our data size is 1 bit
above 2GB, the table must be at least 4GB, giving us 50% load. Finding
a fast alternative modulo operation is critical for creating a table
with high load without loosing much performance.
Professor Lemire is probably the first person that addresses this issue. He wrote a blog post that provides a fast alternative to modulo.
1 2 3 

He named this method as fast range. Another intuitive way to think about it is scaling. Number $x$ ranges $\lbrack 0, 2^{32}1\rbrack$, multiplying it by $N$ then divide by $2^{32}$, the range becomes $\lbrack 0, N1\rbrack$.
There’s one big problem to apply fast range on probing. Probing usually add the probe bias to lower bits of the hashed key. Modulo and bitwise and preserves the lower bits information, but fast range only use the higher bits and the probe would have no effect on the output. The first bits where it can bias the output in fast range is $\frac{2^{32}}{N}$. Hence, writing a linear probing using fast range would be:
1 2 3 4 

To make the output correct we used division again, which makes it slow. Is there a better way?
Fast mod and scale
I created an alternative method with a more relaxed condition. Instead of finding a fast modulo replacement for all N, I want to find some N that satisfy fast modulo and can preserve the biases of probing.
The actual algorithm is pretty simple: First, mask the hashed key to the next power of 2 boundary, then multiply it by $\frac{N}{16}, N=8..15$. This is a combination of traditional power of 2 modulo and professor Lemire’s scaling method. The difference is now the scale can only get up to 2 times. In other words, only the least significant bit will get omitted when scaling. The probing implementation can be written as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

This is the straight copy of my robin hood hash
implementation. When the probe is scaled by 2 it is guaranteed to
have biases on the output. The mask can be derived from leading zeros
of the capacity capacity_clz
, the scale is defined by the most
significant 4 bits of the capacity capacity_ms4b
. The capacity_ms4b
is precomputed on hash table creation or resizing time. It’s a round
up of desired capacity with finer granularity compare to power of 2
tables.
I used Intel Architecture Code Analyzer to analyze the instruction throughput of my methods, and the result is very satisfying:
 Power of 2 table with quadratic probing
 Block Throughput: 4.10 Cycles
 Total Num Of Uops: 9
 Fast mod and scale with quadratic probing
 Block Throughput: 4.15 Cycles
 Total Num Of Uops: 12
Benchmarks
I hope all these analysis didn’t bored you all! Turns out these analysis are all useful. We now have a hash table with very optimal memory usage but still having great performance.
The most impressive part is the memory usage. Under load 89% we achieve overhead 1.20x ~ 1.50x. The ideal overhead should be 1.12 but we have an extra byte used per bucket to determine whether the bucket is emptied or tumbstoned.
The insertion time is not as good as dense_hash_map
under high load.
The reason is robin hood hashing moves the buckets around during the
insert, but dense_hash_map
simply probe and insert it to an empty
bucket if found.
Luckily, robin hood hashing gets a faster lookup time compare to
dense_hash_map
. I think the major reason is robin hood hashing
results a great expected probing number, and the overall throughput
benefits from it.
The benchmark code is available at hash_bench. My robin hood hashing implementation is available at opic robin hood hashing.
Summary
Hash table implementations has been focused on its speed over memory usages. Turns out we can sacrifice some insertion time to gain way better memory utilization, and also improve the look up time. I believe this can be the new state of the art implementation for hash tables. Let me know what you think in the comments. :)
Many details were omitted in this post, but will be discussed on my next post. Some outlines for the things I’d like to cover would be
 Probe distributions under different probing strategies (linear probing, quadratic probing, double hashing, and some probing methods I created).
 Optimize probing by using gcc/clang vector extensions
 Deletion mechanisms, its performance, and how it affects probe distributions.
 Serialization and deserialization performance
 Performance with different popular hash functions
 Benchmark with other robin hood implementations
 Benchmark with other embedded keyvalue store.
I may not be able to cover all the above in my next post, so please put down your comment and let me know what do you want to read the most.
One more thing…
This robin hood hashing is implemented using my project Object Persistence In C (OPIC). OPIC is a new general serialization framework I just released. Any inmemory object created with OPIC can be serialized without knowing how it was structured. Deserializing objects from OPIC only requires one mmap syscall. That’s say, this robin hood implementation can work not only in a living process, the data it stored can be used as a keyvalue store after the process exits.
Right now, the throughput of OPIC robin hood hash map on small keys (6bytes) is 9M (1048576/0.115454). This is way better than most NoSQL keyvalue stores. The difference might come from write ahead logs or some other IO? I’m not sure why the performance gain is so huge. My next stop is to benchmark against other embedded keyvalue store like rocksdb, leveldb and so forth.
References
If you’d like to know more about robin hood hashing, here are some posts worth to read:
Edits
5/7/17
As people pointed out in hacker news and comment below, C++
std::string
has 24 bytes overhead on small strings, so the memory
comparison is not fair. I’ll conduct another set of benchmarks using
integers tonight.
Also, one of the author of libcuckoo (@dga) pointed out that libcuckoo would perform better if I use threadunsafe version. I’ll also update the benchmark with this new setup.
The short string problem brings up a question: what is the best practice to use C++ hash map with short strings? Isn’t this a common use case in daily programming? I tried to do some quick search but didn’t find any useful information, and I’m suck at C++… Any good idea on how to do this better?