One of my favorite data structures is the hash table because it’s simple & powerful.

You probably have used it before since it’s an efficient way to store key-value pairs.

There are some interesting computer science concepts behind the implementation of a hash table that are worth studying, and that’s exactly what we are going to do in this article!

The basic idea of a hash table is to allow us to efficiently (in `O(1)`

) retrieve data that is indexed by a key.

As a quick refresher, this is what using a hash table looks like in Ruby:

prices = { apple: 0.50, ice_cream: 3, steak: 10 }

To implement a hash table we need two components:

- A place to store the table entries
- A way to assign key/value pairs to a specific position (index) inside this data store

In other words, we need an array & a hash function.

The **hash function** is an important component of a hash table.

This function transforms the key into an index that can be used to lookup or update its associated value.

This is the big difference between plain arrays & hash tables.

In an array, you can only access values via their index and the index can only be a number. In a hash table, you **access values via their key & the key can be anything** (string, symbol, integer…), as long as you can write a hash function for it.

You can write a simple hash function for strings by converting all the letters into their ASCII values then adding them up.

**Here is an example**:

BUCKETS = 32 def hash(input) input.to_s.chars.inject(0) { |sum, ch| sum + ch.ord } % BUCKETS end

In this method we use `to_s`

to make sure we are working with a string.

This will help us avoid ‘undefined method’ errors. Then a combination of `chars`

(to convert the string into an `Array`

of its characters) & `inject`

for adding up the values.

Inside the block I used the `ord`

method to convert the characters into their ordinal values.

Finally, I used the modulo operator `%`

to make sure the resulting value fits into the array. We call each entry in this array a ‘bucket’.

Ideally we want all our buckets to be filled evenly, this will give us the best performance when we need to retrieve a value.

Let’s look at what happens when we test our hash function using this code:

# Create an array of size BUCKETS with all elements set to 0 table = Array.new(BUCKETS) { 0 } letters = Array('a'..'z') 10_000.times do # Create a random string input = Array.new(5) { letters.sample }.join # Count hash distribution table[hash(input)] += 1 end

This produces the following results:

[302, 290, 299, 309, 321, 293, 316, 301, 296, 306, 340, 321, 313, 304, 318, 296, 331, 306, 348, 330, 310, 313, 298, 292, 304, 315, 337, 325, 325, 331, 319, 291]

It looks like our keys are evenly distributed…

…but what happens if we ramp up the number of buckets?

In this example I used a bucket size of 128 (it was 32 before):

[22, 24, 33, 36, 41, 58, 61, 66, 97, 77, 88, 110, 89, 82, 123, 121, 119, 111, 147, 178, 136, 176, 144, 180, 190, 193, 185, 192, 223, 209, 208, 196, 215, 251, 233, 226, 231, 236, 219, 218, 227, 221, 206, 220, 208, 213, 201, 191, 182, 165, 188, 141, 160, 135, 130, 117, 139, 106, 121, 85, 70, 93, 74, 61, 57, 54, 40, 46, 32, 36, 30, 21, 25, 17, 14, 16, 16, 14, 8, 11, 5, 5, 1, 1, 2, 1, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 4, 3, 6, 0, 2, 9, 13, 11, 12, 14, 12, 23, 12, 22]

That doesn’t look like a great distribution anymore!

What’s going on?

The problem is that our hash function is not good enough because all the strings of the same size stay in a certain range. That’s why we have a lot of empty buckets in the middle.

We need a better way to convert our string into an index. Let’s take a look at one possible implementation.

BUCKETS = 256 def hash(input) input.to_s.each_char.inject(0) do |sum, ch| (sum << 8) ^ (ch.ord) ^ (sum >> 4) end % BUCKETS end

What’s happening here is bit shifting (with the `>>`

& `<<`

operators). The values are combined using the “exclusive OR operator” (`^`

).

This bit shifting will mix things up, which will give us a **better key distribution**. Not perfect, but it’s better than our simple ASCII-based function ðŸ™‚

If you want a proper hash function you would be looking at something like MurmurHash, which I believe is what Ruby uses.

We don’t have a useful hash table yet.

Why?

Well, you may have noticed that when two keys hash to the same index they will overwrite the old value, and that’s not good!

This is called a hash collision & there are a few strategies to deal with these.

**For example**:

- Double Hashing
- Linear probing
- Separate chaining

Let’s take a look at separate chaining, where we use a linked-list to store the entries for a particular “bucket”.

So if we assume that `:abc`

& `:ccc`

hash to the same index, our hash table would look something like this:

3: [:abc, 100] -> [:ccc, 200] 4: nil 5: [:yx, 50]

Then we will need a linear search to find the correct key.

This will have an impact on our performance because our lookup time can slowly degrade towards `O(n)`

, instead of the expected `O(1)`

.

If you are not familiar with this

`O(something)`

notation that’s called “Big-O Notation“.

To avoid the linked list from growing too deep & therefore degrade the performance of the hash table, it’s necessary to recreate the hash table using a bigger number of buckets.

Ruby does this for you automatically, but still good to know.

The purpose of this article is not to have you writing a hash table implementation, but to help you understand how they actually work, so I hope that you found that interesting!

Don’t forget to share this post to keep the blog going ðŸ™‚