В хеш-таблице обработка новых индексов производится при помощи ключей. А элементы, связанные с этим ключом, сохраняются в индексе. Этот процесс называется хешированием.
Пусть k — ключ, а h(x) — хеш-функция.
Тогда h(k) в результате даст индекс, в котором мы будем хранить элемент, связанный с k .
Когда хеш-функция генерирует один индекс для нескольких ключей, возникает конфликт: неизвестно, какое значение нужно сохранить в этом индексе. Это называется коллизией хеш-таблицы.
Есть несколько методов борьбы с коллизиями:
- метод цепочек;
- метод открытой адресации: линейное и квадратичное зондирование.
1. Метод цепочек
Суть этого метода проста: если хеш-функция выделяет один индекс сразу двум элементам, то храниться они будут в одном и том же индексе, но уже с помощью двусвязного списка.
Если j — ячейка для нескольких элементов, то она содержит указатель на первый элемент списка. Если же j пуста, то она содержит NIL .
2. Открытая адресация
В отличие от метода цепочек, в открытой адресации несколько элементов в одной ячейке храниться не могут. Суть этого метода заключается в том, что каждая ячейка либо содержит единственный ключ, либо NIL .
Существует несколько видов открытой адресации:
a) Линейное зондирование
Линейное зондирование решает проблему коллизий с помощью проверки следующей ячейки.
h(k, i) = (h′(k) + i) mod m ,
- i =
- h'(k) — новая хеш-функция
Если коллизия происходит в h(k, 0) , тогда проверяется h(k, 1) . То есть значение i увеличивается линейно.
Проблема линейного зондирования заключается в том, что заполняется кластер соседних ячеек. Это приводит к тому, что при вставке нового элемента в хеш-таблицу необходимо проводить полный обход кластера. В результате время выполнения операций с хеш-таблицами увеличивается.
b) Квадратичное зондирование
Работает оно так же, как и линейное — но есть отличие. Оно заключается в том, что расстояние между соседними ячейками больше (больше одного). Это возможно благодаря следующему отношению:
- c1 и c2 — положительные вспомогательные константы,
- i =
c) Двойное хэширование
Если коллизия возникает после применения хеш-функции h(k) , то для поиска следующей ячейки вычисляется другая хеш-функция.
h(k, i) = (h1(k) + ih2(k)) mod m
«Хорошие» хеш-функции не уберегут вас от коллизий, но, по крайней мере, сократят их количество.
Ниже мы рассмотрим различные методы определения «качества» хэш-функций.
1. Метод деления
Если k — ключ, а m — размер хеш-таблицы, то хеш-функция h() вычисляется следующим образом:
Например, если m = 10 и k = 112 , то h(k) = 112 mod 10 = 2 . То есть значение m не должно быть степенью 2. Это связано с тем, что степени двойки в двоичном формате — 10, 100, 1000… При вычислении k mod m мы всегда будем получать p-биты низшего порядка.
2. Метод умножения
- kA mod 1 отделяет дробную часть kA;
- ⌊ ⌋ округляет значение;
- A — произвольная константа, значение которой должно находиться между 0 и 1. Оптимальный вариант ≈ (√5-1) / 2, его предложил Дональд Кнут.
3. Универсальное хеширование
В универсальном хешировании хеш-функция выбирается случайным образом и не зависит от ключей.
Python Hash Tables Under the Hood
#python #hash-tables #intermediate #dictionary
Table of Contents
Are you a Python developer eager to learn more about the internals of the language, and to better understand how Python hash tables and data structures work? Or maybe you are experienced in other programming languages and want to understand how and where hash tables are implemented in Python? You’ve come to the right place!
By the end of this article, you will have an understanding of:
- What a hash table is
- How and where hash tables are implemented in Python
- How the hash function works
- What’s happening under the hood in dictionaries
- The pros and cons of hash tables in Python
- How are Python dictionaries so fast
- How to hash your custom classes
This tutorial is aimed at intermediate and proficient Python developers. It assumes basic understanding of Python dictionaries and sets.
Let’s dive into looking at Python hash tables!
Learning the Basics of Hash Tables
Before diving into the Python implementation details, you first need to understand what hash tables are and how to use them.
What Is a Hash Table
Have you ever thought about how a Python dictionary is stored in memory? The memory in our computers can be thought of as a simple array with numeric indexes:
A hash table is a structure that is designed to store a list of key-value pairs, without compromising on speed and efficiency of manipulating and searching the structure.
A hash table uses a hash function to compute an index, into an array of slots, from which the desired value can be found.
How Does a Hash Function Work
A hash function, in its simplest form, is a function that computes the index of the key-value pair — so you can quickly insert, search and remove elements in your memory array.
You should start off with a simple example: A dictionary-like object, containing 3 products and their inventory count.
As for the hash function, you would need a method to turn the string keys into numeric values so you can quickly look them up in the memory.
Try it yourself
Try to think of a simple operation to perform on string values to turn them into numeric values.
So you’ve decided to calculate the length of the string values, awesome! Don’t forget that the numeric values must be within 0 to 3 for them to fit inside the 3 elements array. You can use the modulo operator on the lengths for that:
For example, avocados has 8 letters, therefore it would be placed in the 2nd index. This is the final array:
Hurrah! You’ve just built your first hash table! Hold on for a second there. What happens when two keys has the same length? You won’t be able to insert both at the same index. This is called a hash collision.
How to Handle Collisions
Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. There are multiple strategies for collision handling — All of which require that the keys be stored in the table, together with the associated values. Two of the most common ones are Open Addressing and Separate Chaining.
Open Addressing: When a hash collision occurs, this algorithm proceeds in some probe sequence until an unoccupied slot is found. This startegy consists of storing all records in a single array. For example, a simple implementation of this strategy would be to proceed one element further every time that the calculated index is already occupied, until an unoccupied spot is found. This implementation is called linear probing.
Consider the following example of a simple hash table mapping names to phone numbers using your hash function from earlier:
John and Lisa collide since they both hash to the index 0 . Lisa was inserted after John, so it got inserted into one index further — 1 . Note that Sandra Dee has a unique hash, but nevertheless collided with Lisa Smith, that had previously collided with John Smith.
- Separate Chaining: As opposed to open addressing, this strategy consists of storing multiple arrays. Each record contains a separate array which holds all of the elements with the same calculated index. The following is the sample table from earlier, using the separate chaining strategy:
This time, Sandra Dee did not collide with Sandra since each element holds a pointer to an array of collided records.
Understanding Python Hash Tables
When you come to think about it, Python dictionary keys allow way more types than just integers. They can be strings, functions, and more. How does Python store these keys inside the memory, and know where to find their values?
You guessed it right — hash tables! Python implements hash tables under the hood, an implementation that is completely unvisible for you as a developer. Nonetheless, it could be of great use for you to understand how Python hash tables are implemented, their optimizations, and how to use them wisely.
Let’s dive into some of the internals to better understand how all of these new concepts you’ve just learned are put to practice.
Exploring Python Hash Function
Python hash function takes a hashable object and hashes into 32/64 bits (depends on the system architecture). The bits are well distributed as can be seen in the following example, showing two very similar strings — “consecutive strings”, commonly used in dictionaries, with very different hash values:
This distribution lowers the odds of a hash collision, which in turn makes the dictionary much faster. Furthermore, you should know that a hash value is only constant for the current instance of the process. You might have stumbled upon different hashes of the same object and wondered why it happened. The main reason for this phenomenon is security related: Hash tables are vulnerable to hash collision DoS attacks when using constant hash values. Python 3.6 introduced an implementation of SipHash to prevent these attacks. You can read more about it on PEP 456. The following demonstrates different hashes on two different runs:
Handling Collisions in Python
Python uses open addressing to resolve hash coliisions. Python source code suggets that open addressing is preferred over chaining since the link overhead for chaining would be substantial. Earlier, you’ve seen an example of linear probing. Python uses random probing, as can be seen in the source code, or in this very simplified Python code:
Much like linear probing, the first part of this algorithm proceeds in a fixed manner ( current_index*5+1 ). Python core developers found it to be good in common cases where hash keys are consecutive. The other part of this algorithm is what makes it random — The perturb variable depends on the actual bits of the hash, and as you’ve seen before — They are well distributed and not often the same even for similar keys.
Deleting Elements Using Dummy Values
In a perfect world with no collisions, deleting elements would be simple: just remove the element at the desired index so it can be refilled with other values. Unfortunately, we do not live in a perfect world, and collisions happen frequently. Remember the open addressing example from before?
Now assume you wanted to delete John Smith from the table. Seems trivial, right? Hash John Smith and delete the element located at the calculated index. Now, you might have already noticed the problem in this approach. After deleting John, Sandra is unreachable! Hashing Sandra will get us to an empty slot. For this exact reason, Python implements dummy values — Instead of completely erasing John’s element, it would place a fixed dummy value there. When the algorithm faces a dummy value, it knows that there was a value there but it got deleted. It then keeps probing forward.
Implementing Everything With Python
Now that you know what hash tables are, how the Python hash function works and how Python handles collisions, it’s time to see these things in action by exploring the implementation of a dictionary and the lookup method. The lookup method is used in all operations: search, insertion and deletion.
First thing you need to know is that Python initializes a dict with an 8 element array. Experiments showed that this size suffices for most common dicts, usually created to pass keyword arguments. Each element in the array holds a structure that contains the key, value and the hash. The hash is stored in order to not recompute it with each increase in the size of the dictionary (further explained in Exploring Python Hash Tables Optimizations: Dictionary Resize).
Up until now, memory indices were displayed as decimal values. From this point onward, you will notice that memory addresses will be displayed as binary integers. Don’t be scared! You can continue reading even if you’re not familiar with them — I only use it for you to better understand how Python hash tables work with bits, but it’s not mandatory to understand that.
On to the lookup method. Here’s a simplified Python version, followed by an explanation:
Line 5 calculates the index with the generate_probes method shown in the next code block below.
Line 8 checks if the found index is empty, and returns a tuple (index, None) for the caller to handle (search operation would raise a KeyError, insertion would insert), unless a dummy was found earlier, in which case return (index, DUMMY) for the caller to handle (search operation would raise KeyError but it had to continue searching after the dummy).
Line 9 checks if the found index contains a dummy value, if so it keeps searching and saving that index for line 8.
Line 12 compares the identities of the keys. If they are the same object, the index is returned.
If not, it compares the key value and the hash. As it is known, equal objects should have equal hashes, which means that objects with different hashes are not equal. If both are equal, the index is returned.
If the desired element hasn’t been found yet (remember that the slot wasn’t empty/dummy), this is a hash collision situation — In which the script goes back to the generate_probes method to compute the random new hash and then go back to the first step.
- Line 3 masks the hash bits with the size of the table minus one — For example, in a table with the size of 8, the last 3 bits would be taken (111 in binary equals 7 in decimal, so 3 bits can represent 0-7). The following demonstration shows a hash example with its last 3 bits taken for indexing:
- Line 7: Remember that the algorithm goes back to the generate_probes method if the desired wasn’t found? This is the line that it goes back to. It computes the new hash using a random probe and yields control back to the lookup method.
If you were to write the search operation, it would have looked along the lines of the following:
As I’ve mentioned before, the DUMMY and None handling is done within the caller — while this specific method raises a KeyError , other operations could have still used that index as you will soon see.
Would the insertion operation use the index received from the lookup method even if the element is None or DUMMY ?
Understanding Python Sets
Along with dictionaries, Python hash tables also serve as the underlying structure for sets. Both implementations are quite similar as can be seen in the source code, with the exception that sets do not store a value for each key, meaning the optimizations of Python, which are shown later in this article, are not applicable for them. The usage of hash tables for sets make the lookup operation, which is used frequently in sets in order to keep them without duplicates, quite fast (as you should know by now — It always depends on the collisions).
Exploring Python Hash Tables Optimizations
The methods above are not entirely identical to Python’s. I didn’t want to overcomplicate things, so I’ve left out some details that make the Python dictionary blazing fast. The following section will guide you through building a custom dictionary, implementing the optimizations of Python hash tables.
Your first step would be to create a Dictionary class:
You will soon fill up these methods. Notice that generate_probes is now a static method — No reason for it to be an instance method since it does not use any instance attributes. lookup and search from before are not used since they both will change quite a bit.
Let’s dive into Python hash tables optimizations!
Compact dictionaries optimize the space that hash tables occupy. Before they were implemented, Python has had sparse hash tables — Each unoccupied slot took as much space as an occupied slot because it had to save space for the key, hash and value. Compact dictionaries introduced a much smaller table just for indices, and a separate table for the keys, values and hashes. This way, the indices table could be the sparse one while the bigger table is dense.
For example, before compact dictionaries, the following is how a dictionary and its corresponding memory array looked like (taken from Raymond Hettinger’s text):
Instead, with compact dictionaries, two different tables are built:
Take Timmy for example. It gets hashed into the 5th index, which contains the number 0 in the indices table, which in turn contains the actual timmy element in the entries table.
Raymond Hettinger, the creator of compact dictionaries, said:
The memory savings are significant (from 30% to 95% compression depending on the how full the table is).
In addition to space savings, the new memory layout makes iteration faster. keys/values/items can loop directly over the dense table, using fewer memory accesses.
This piece of code inserts new elements to a dictionary, and checks the size after each insertion until it reaches 1000 elements. Here’s how Python 3.8 compact dictionary sizes compare to Python 2.7 non-compact dictionaries:
|Number of keys
In order to implement this feature in your custom dictionary, the dictionary needs to implement two separates arrays: one for the indices and one for the actual entries. Before you do that, there are other optimizations to consider, therefore the final implementation will take place soon, in Putting it All Together.
Meanwhile, you can create the method that builds the indices array, and initialize the class with indices , filled and used variables:
The indices array is of a single type, therefore DUMMY and FREE should be of the same type.
The indices array will hold the sparse table of indices (pointing to the actual entries)
used and filled variables are used to hold the size of the dictionary with and without dummy values (respectively)
_make_index_array uses the array module in order to compactly represent an array of basic values. It follows the logic of Python source code in order to determine the size of the indices.
Compact dictionaries were implemented in Python 3.6.
Key Sharing Dictionaries
Python 3.3 introduced key-sharing dictionaries. When dictionaries are created to fill the dict slot of an object, they are created in split form. The keys table is cached in the type, potentially allowing all attribute dictionaries of instances of one class to share keys. This behaviour happens in the __init__ method of classes, and aims to save memory space and to improve speed of object creation. Your takeaway from this should be to always strive to assign attributes in the __init_ method so your custom classes can use key-sharing dictionaries.
The following visual demonstrates three instances of the same class:
You can see that each instance only holds its values, while there is a single, shared place in memory for the keys.
Let’s implement this awesome feature in your dictionary:
Line 1 creates a DictKey class which holds a key and its hash value
Lines 15-16 sets two separate lists for entries: One for the keys, which may be shared with other instances, and one for the values.
Line 19: The dictionary is planned to be initialized with values. The update method takes care of that: No matter what type of argument received, as long as its an iterable, the dictionary will use its contents.
Line 22 sets _sharing_keys to False in order to mark whether this instance shares its keys with other instances.
Lines 23 checks if a Dictionary arguments was given as argument, then it goes on to checking if the current dictionary is already filled with keys and values. If so, it copies the keys and values from the other dictionary.
If the current dictionary is not filled, there’s the key sharing awesomeness! Line 28 copies the other dictionary indices in order to preserve functionallity. keys and values must remain in sync, so values needs to be initalized with None values since you don’t want the values of the other dictionary. The other instance keys is then only referenced by self.keys .
If the other instance is not of type Dictionary , line 34 checks if it has a keys attribute, like a real dictionary. If so, it copies its keys and values.
If all else fails, line 38 treats the other instance as if it had keys and values in it and copy them.
Finally, line 40 copies the keys and values given in **kwargs as well.
_check_keys_sharing converts the dictionary to a non-shared dictionary. It is planned to be called when a shared dictionary tries to change its values.
Python checks for the table size everytime we add a key, and if the table is two-thirds full, it would resize the hash table. If a dictionary has 50000 keys or fewer, the new size is used_size * 4 , otherwise, it is used_size * 2 . Remember that Python stores the hash value along with the key and the value? This is where it comes in handy! Instead of rehashing the keys when inserting them to the new bigger table, the stored hashes are used. You might wonder — What if the key object was changed? In this case, the hash should be recalculated and the stored value will be incorrect? Such a situation is impossible, since mutable types cannot be keys of a dictionary. Here’s how you implement the resize operation:
Line 3: Python hash table sizes are powers of 2, so we will also use powers of 2. The primary reason Python uses “round” powers of 2 is efficiency: computing % 2**n can be implemented using bit operations, as you’ve seen before.
Line 4 builds a new indices array with the new bigger size.
Lines 5 — 9 loops through the keys. For every key, it generates an index. If that index is free in the new indices table, it inserts that key’s index into the indices table. Essentially, what this piece of code does is to allocate a new indices table and fill it up with the current keys. Notice no hashing is needed because the hashes are saved.
Line 10: filled should be equal to used since nothing was deleted yet — there should not be any dummy values.
Private Dictionary Versions
Python 3.6 added a new private version to dictionaries, incremented at each dictionary creation and at each dictionary change. The rationale is to skip dictionary lookups if the version does not change, and to use cached values instead. Your implementation can implement a version for each instance, although it won’t implement the actual caching of values.
The __version variable keeps a counter of the number of dictionary instances, so each instance can have a unique version.
_increase_version is planned to be called by operations such as __setitem__ and __delitem__ . It sets the current instance’s version to the latest of the class variable, and increases the class variable by one.
Line 9 initializes the dictionary with the latest version.
Putting It All Together
The full code can be found here.
It’s time to use all these cool new methods and make your dictionary usable!
It’s quite similar to the lookup method from earlier, only this time it uses self.indices as the hash table, FREE to check for empty slots instead of None , and self.keys to check for equality and identity of keys.
Same here! Very similar to the search method you’ve already seen, __getitem__ now uses a simple < 0 check to check if the slot is empty or dummy. If so, it returns a KeyError . If not, it returns the value from the values table.
Line 2 converts the dictionary to a non shared dictionary.
Line 6 increases the version if the slot is unoccupied.
Line 7 fills the hashed index in self.indices with the index of the key and value which is essentially self.used because it’s the last index of keys and values .
Line 13 increases self.filled If that slot was free. That’s done because if it weren’t free, than self.filled would have already included it.
Line 14 resizes the table by 3 if it’s more than 2/3 filled. It follows the logic of C Python resize.
- Line 17 checks if the new value differs from the value that is about to be replaced. The dictionary does not increase the version if nothing changes.
Line 2 converts the dictionary to a non shared dictionary.
Line 6 raises KeyError If the slot is unoccupied.
Line 10 inserts a DUMMY value instead of the actual value.
Line 12: You might have thought that a del operation might suffice, but it would have left a hole inside the keys and values table. These tables must not contain any holes. The solution is to swap with the last item and then delete the last item.
Line 18 changes the swapped element’s indices value to the current spot that’s being swapped.
This method checks if the dictionary contains a certain key. It does so by checking the result of _lookup , it only contains indices below zero in the case of DUMMY or FREE .
The dictionary’s keys list holds DictKey instances. This method creates a new list of actual the actual keys (without the wrapper class) and wraps it inside of an iterable.
Using your custom dictionary
Here’s how you can use your brand new dictionary:
The show method will display all of the attributes, including the version number and the indices table.
You can use the operations you’re used to from normal dictionaries:
In order to play with the keys sharing feature, initialize your dictionary with an already initialized dictionary:
MutableMapping could have been inherited in order to implement common methods like update and pop . I decided not to use it in order to showcase the update method.
As a side effect of using compact dictionaries, when iterating over the dictionary, the array of indices is not needed as the elements are sequentially returned from the entries table. Since elements are added to the end of the entries each time, the dictionary automatically preserves the order of entries.
It was an implementation detail in Python 3.6, but was declared a feature in Python 3.7.
Hashing Your Custom Classes
You want to use a custom class as a dictionary key, or as a value in a set. Now that you know what the hash method does, and how these structures are implemented, you know that you can take advantage of the hash method and that you must also implement __eq__ because the keys are checked for equality. Note that it is required that objects which compare equal have the same hash value. It is advised to mix together the hash values of the attributes of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. The following is an example of a custom class CustomClass hashing:
You can see that hash is being used for hashing a tuple of the custom class attributes, which should be fast enough, and that same tuple is also used for equality check.
Why is it important for hashable objects to contain __eq__ ?
You’ve seen the implementation of the lookup method, which contained the following line (line 63):
Notice that dict_key.key and key are both hashable objects that are being compared.
Understanding When to Use Python Hash Tables
Python hash tables (and hash tables in general) trade space for time. The need to store the key and the hash along with the value of the entries plus the empty slots, make the hash tables take up more memory — but also to be drastically faster (in most scenarios).
set vs list
What’s faster, set or a list, when checking if a certain value exists?
- Speed: You already know that lookups are faster in sets than list — That’s the whole point of having a hash table. Iterating over a list is slighly faster than sets. Don’t take my word for it:
- Space: If space is your main concern rather than speed, you are probably better off with lists — The difference is quite large.
Order: Sets are not ordered, while lists are. As we’ve talked about, sets do not implement the separate dense table that dictionaries do, which results in a single unordered table.
Hashing: Lists do not require their elements to be hashable as opposed to sets.
dict vs namedtuple
Unlike sets and lists, dictionaries and namedtuple are quite different in their uses. The main uses for namedtuple are when you want an unmutable object that is much more readable than a dictionary and simpler than a class. As you will soon find out, namedtuple is also much smaller than a dictionary — so add that to your considerations!
- Speed: Usually, you won’t have to deal with very large namedtuple instances nor will you iterate over them. I will however show the speed comparisons of the lookup function shown above, just in case you will.
- Space: namedtuple takes drastically less space than a dictionary:
Order: Both are ordered (since Python 3.6).
Hashing: namedtuple instances only allow strings to be their attribute names, as opposed to dictionaries that allow everything as long as it is hashable.
Python 3.7 added data classes that you may find easier to use than namedtuple , though this article won’t cover them.
Congratulations! You’ve now had a comprehensive overview on Python hash tables. You’ve learned an important concept in computer science — hash tables, how they are implemented in Python, the awesome optimizations of Python dictionaries, and when to use Python hash tables. You now know that hash tables trade space for time, and you can even practice comparing size and speed of different data structures yourself. I hope that you feel more confident choosing data structures for your next project!
Hash Tables and Hashmaps in Python
Data requires a number of ways in which it can be stored and accessed. One of the most important implementations includes Hash Tables. In Python, these Hash tables are implemented through the built-in data type i.e, dictionary. In this article, you will learn what are Hash Tables and Hashmaps in Python and how you can implement them using dictionaries.
Before moving ahead, let us take a look at all the topics of discussion:
- What is a Hash table or a Hashmap in Python?
- Hash table vs Hashmap
- Creating Dictionaries
- Creating Nested Dictionaries
- Performing Operations on Hash Tables using dictionaries
- Accessing Values
- Updating Values
- Deleting Items
- Converting a Dictionary into a Dataframe
What is a Hash table or a Hashmap in Python?
In computer science, a Hash table or a Hashmap is a type of data structure that maps keys to its value pairs (implement abstract array data types). It basically makes use of a function that computes an index value that in turn holds the elements to be searched, inserted, removed, etc. This makes it easy and fast to access data. In general, hash tables store key-value pairs and the key is generated using a hash function.
Hash tables or has maps in Python are implemented through the built-in dictionary data type. The keys of a dictionary in Python are generated by a hashing function. The elements of a dictionary are not ordered and they can be changed.
An example of a dictionary can be a mapping of employee names and their employee IDs or the names of students along with their student IDs.
Moving ahead, let’s see the difference between the hash table and hashmap in Python.
Hash Table vs hashmap: Difference between Hash Table and Hashmap in Python
Dictionaries can be created in two ways:
- Using curly braces (<>)
- Using the dict() function
Using curly braces:
Dictionaries in Python can be created using curly braces as follows:
Using dict() function:
Python has a built-in function, dict() that can be used to create dictionaries in Python. For example:
In the above example, an empty dictionary is created since no key-value pairs are supplied as a parameter to the dict() function. In case you want to add values, you can do as follows:
Creating Nested Dictionaries:
Nested dictionaries are basically dictionaries that lie within other dictionaries. For example:
Performing Operations on Hash tables using Dictionaries:
There are a number of operations that can be performed on has tables in Python through dictionaries such as:
- Accessing Values
- Updating Values
- Deleting Element
The values of a dictionary can be accessed in many ways such as:
- Using key values
- Using functions
- Implementing the for loop
Using key values:
Dictionary values can be accessed using the key values as follows:
OUTPUT: ‘ 001′
There are a number of built-in functions that can be used such as get(), keys(), values(), etc.
dict_keys([‘Dave’, ‘Ava’, ‘Joe’])
dict_values([‘001’, ‘002’, ‘003’])
Implementing the for loop:
The for loop allows you to access the key-value pairs of a dictionary easily by iterating over them. For example:
All keys and values
Dave : 001
Ava : 002
Joe : 003
Dictionaries are mutable data types and therefore, you can update them as and when required. For example, if I want to change the ID of the employee named Dave from ‘001’ to ‘004’ and if I want to add another key-value pair to my dictionary, I can do as follows:
Deleting items from a dictionary:
There a number of functions that allow you to delete items from a dictionary such as del(), pop(), popitem(), clear(), etc. For example:
The above output shows that all the elements except ‘Joe: 003’ have been removed from the dictionary using the various functions.
Converting Dictionary into a dataframe:
As you have seen previously, I have created a nested dictionary containing employee names and their details mapped to it. Now to make a clear table out of that, I will make use of the pandas library in order to put everything as a dataframe.
I hope you are clear with all that has been shared with you in this tutorial. This brings us to the end of our article on Hash Tables and Haspmaps in Python. Make sure you practice as much as possible and revert your experience.
If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.
Do look out for other articles in this series which will explain the various other aspects of Python and Data Science
Python Hash Tables: Understanding Dictionaries
Hi guys, have you ever wondered how can Python dictionaries be so fast and reliable? The answer is that they are built on top of another technology: hash tables.
Knowing how Python hash tables work will give you a deeper understanding of how dictionaries work and this could be a great advantage for your Python understanding because dictionaries are almost everywhere in Python.
Before introducing hash tables and their Python implementation you have to know what is a hash function and how it works.
A hash function is a function that can map a piece of data of any length to a fixed-length value, called hash.
Hash functions have three major characteristics:
- They are fast to compute: calculate the hash of a piece of data have to be a fast operation.
- They are deterministic: the same string will always produce the same hash.
- They produce fixed-length values: it doesn’t matter if your input is one, ten, or ten thousand bytes, the resulting hash will be always of a fixed, predetermined length.
Another characteristic that is quite common in hash functions is that they often are one-way functions: thanks to a voluntary data loss implemented in the function, you can get a hash from a string but you can’t get the original string from a hash. This is not a mandatory feature for every hash functions but becomes important when they have to be cryptographically secure.
Some popular hash algorithms are MD5, SHA-1, SHA-2, NTLM.
If you want to try one of these algorithms by yourself, just point your browser to https://www.md5online.org, insert a text of any length in the textbox, click the crypt button and get your 128bit MD5 hash back.
Common Usages of Hashes
There are a lot of things that rely on hashes, and hash tables are just one of them. Other common usages of hashes are for cryptographic and security reasons.
A concrete example of this is when you try to download open-source software from the internet. Usually, you find also a companion file that is the signature of the file. This signature is just the hash of the original file and it’s very useful because if you calculate the hash of the original file by yourself and you check it against the signature that the site provides, you can be sure that the file you downloaded hasn’t have tampered.
Another common use of hashes is to store user passwords. Have you ever asked yourself why when you forget the password of a website and you try to recover it the site usually lets you choose another password instead of giving back to you the original one you chose? The answer is that the website doesn’t store the entire password you choose, but just its hash.
This is done for security reasons because if some hacker got the access to the site’s database, they won’t be able to know your password but just the hash of your password, and since hash functions are often one-way functions you can be sure that they will never be able to get back to your password starting from the hash.
The Python hash() Function
Python has a built-in function to generate the hash of an object, the hash() function. This function takes an object as input and returns the hash as an integer.
Internally, this function invokes the .__hash__() method of the input object, so if you want to make your custom class hashable, all you have to do is to implement the .__hash__() method to return an integer based on the internal state of your object.
Now, try to start the Python interpreter and play with the hash() function a little bit. For the first experiment, try to hash some numeric values:
If you are wondering why these hashes seems to have different length remember that the Python hash() function returns integers objects, that are always represented with 24 bytes on a standard 64 bit Python 3 interpreter.
As you can see, by default the hash value of an integer value is the value itself. Note that this works regardless of the type of the value you are hashing, so the integer 1 and the float 1.0 have the same hash: 1 .
What’s so special about this? Well, this shows what you learned earlier, that is that hash functions are often one-way functions: if two different objects may have the same hash, it’s impossible to do the reverse process starting from a hash and going back to the original object. In this case, the information about the type of the original hashed object has gone lost.
Another couple of interesting things you could note by hashing numbers is that decimal numbers have hashes that are different from their value and that negative values have negative hashes. But what happens if you try to hash the same number you got for the decimal value? The answer is that you get the same hash, as shown in the following example:
As you can see, the hash of the integer number 230584300921369408 is the same as the hash of the number 0.1 . And this is perfectly normal if you think of what you learned earlier about hash functions because if you can hash any number or any string getting a fixed-length value since you can’t have infinite values represented by a fixed-length value, that implies that there must be duplicated values. They exist in fact, and they are called collisions. When two objects have the same hash, it is said that they collide.
Hashing a string is not much different from hashing a numeric value. Start your Python interpreter and have a try hashing a string:
As you can see a string is hashable and produce a numeric value as well but if you have tried to run this command you could see that your Python interpreter hasn’t returned the same result of the example above. That’s because starting from Python 3.3 values of strings and bytes objects are salted with a random value before the hashing process. This means that the value of the string is modified with a random value that changes every time your interpreter starts, before getting hashed. If you want to override this behaviour, you can set the PYTHONHASHSEED environment variable to an integer value greater than zero before starting the interpreter.
As you may expect this is a security feature. Earlier you learned that websites usually store the hash of your password instead of the password itself to prevent an attack to the site’s database to stole all the site passwords. If a website stores just the hash as it is calculated it could be easy for attackers to know what was the original password. They just need to get a big list of commonly used passwords (the web is full of these lists) and calculate their corresponding hash to get what is usually called rainbow tables.
By using a rainbow table the attacker may not be able to get every password in the database, still being able to steal a vast majority of them. To prevent this kind of attack, a good idea is to salt the password before hashing them, which is modifying the password with a random value before calculating the hash.
Starting from Python 3.3 the interpreter by default salt every string and bytes object before hashing it, preventing possible DOS attacks as demonstrated by Scott Crosby and Dan Wallach on this 2003 paper.
A DOS attack (where DOS stands for Denial Of Service) is an attack where the resources of a computer system are deliberately exhausted by the attacker so that the system is no longer able to provide service to the clients. In this specific case of the attack demonstrated by Scott Crosby, the attack was possible flooding the target system with a lot of data whose hash collide, making the target system use a lot more of computing power to resolve the collisions.
Python Hashable Types
So at this point, you could wonder if any Python type is hashable. The answer to this question is no, by default, just immutable types are hashable in Python. In case you are using an immutable container (like a tuple) also the content should be immutable to be hashable.
Trying to get the hash of an unashable type in Python you will get a TypeError from the interpreter as shown in the following example:
However, every custom defined object is hashable in Python and by default its hash is derived from it’s id. That means that two different instance of a same class, by default have different hashes, as shown in the following example:
As you can see, two different instances of the same custom object by default have different hash values. However, this behavior can be modified by implementing a .__hash__() method inside the custom class.
Now that you know what a hash function is, you can start examining hash tables. A hash table is a data structure that allows you to store a collection of key-value pairs.
In a hash table, the key of every key-value pair must be hashable, because the pairs stored are indexed by using the hash of their keys. Hash tables are very useful because the average number of instructions that are necessary to lookup an element of the table is independent of the number of elements stored in the table itself. That means that even if your table grows ten or ten thousand times, the overall speed to look up a specific element is not affected.
A hash table is typically implemented by creating a variable number of buckets that will contain your data and indexing this data by hashing their keys. The hash value of the key will determine the correct bucket to be used for that particular piece of data.
In the example below, you can find an implementation of a basic hash table in Python. This is just an implementation to give you the idea of how a hash table could work because as you will know later, in Python there’s no need to create your custom implementation of hash tables since they are implemented as dictionaries. Let’s see how this implementation works:
Look at the for loop starting at line 9. For each element of the hashtable this code calculate the hash of the key (line 10), it calculate the position of the element in the bucket depending on the hash (line 11) and add a tuple in the bucket (line 12).
Try to run the example above after setting the environment varible PYTHONHASHSEED to the value 46 and you will get the the following output, where two buckets are empty and two other buckets contains two key-value pairs each:
Note that if you try to run the program without having set the PYTHONHASHSEED variable, you may probably get a different result, because as you already know the hash function in Python, starting from Python 3.3 salts every string with a random seed before the hashing process.
In the example above you have implemented a Python hash table that takes a list of tuples as input and organizes them in a number of buckets equal to the length of the input list with a modulo operator to distribute the hashes in the table.
However, as you can see in the output, you got two empty buckets while the other two have two different values each. When this happens, it’s said that there’s a collision in the Python hash table.
Using the standard library’s hash() function, collisions in a hash table are unavoidable. You could decide to use a higher number of buckets and lowering the risk of incurring in a collision, but you will never reduce the risk to zero.
Moreover, the more you increase the number of buckets you will handle, the more space you will waste. To test this you can simply change the bucket size of your previous example using a number of buckets that is two times the length of the input list:
«`python hl_lines=”3” class Hashtable: def init(self, elements): self.bucket_size = len(elements) * 2 self.buckets = [ for i in range(self.bucket_size)] self._assign_buckets(elements)
As you can see, two hashes collided and have been inserted into the same bucket.
Since collisions are often unavoidable, to implement a hash table requires you to implement also a collision resolution method. The common strategies to resolve collisions in a hash table are:
- open addressing
- separate chaining
The separate chaining is the one you already implemented in the example above and consists of creating a chain of values in the same bucket by using another data structure. In that example, you used a nested list that had to be scanned entirely when looking for a specific value in an over occupied bucket.
In the open addressing strategy, if the bucket you should use is busy, you just keep searching for a new bucket to be used. To implement this solution, you need to do a couple of changes to both how you assign buckets to new elements and how you retrieve values for a key. Starting from the _assign_buckets() function, you have to initialize your buckets with a default value and keep looking for an empty bucket if the one you should use has been already taken:
As you can see, all the buckets are set to a default None value before the assignment, and the while loop keeps looking for an empty bucket to store the data.
Since the assignment of the buckets is changed, also the retrival process should change as well, because in the get_value() method you now need to check the value of the key to be sure that the data you found was the one you were looking for:
During the lookup process, in the get_value() method you use the None value to check when you need to stop looking for a key and then you check the key of the data to be sure that you are returning the correct value.
Running the example above, the key for Italy collided with a previously inserted element ( France ) and for this reason has been relocated to the first free bucket available. However, the search for Italy worked as expected:
The main problem of the open addressing strategy is that if you have to handle also deletions of elements in your table, you need to perform logical deletion instead of physical ones because if you delete a value that was occupying a bucket during a collision, the other collided elements will never be found.
In our previous example, Italy collided with a previously inserted element ( France ) and so it has been relocated to the very next bucket, so removing the France element will make Italy unreachable because it does not occupy its natural destination bucket, that appears to be empty to the interpreter.
So, when using the open addressing strategy, to delete an element you have to replace its bucket with a dummy value, which indicates to the interpreter that it has to be considered deleted for new insertion but occupied for retrieval purposes.
Dictionaries: Implementing Python Hash Tables
Now that you know what hash tables are, let’s have a look at their most important Python implementation: dictionaries. Dictionaries in Python are built using hash tables and the open addressing collision resolution method.
As you already know a dictionary is a collection of key-value pairs, so to define a dictionary you need to provide a comma-separated list of key-value pairs enclosed in curly braces, as in the following example:
Here you have created a dictionary named chess_players that contains the top five chess players in the world and their actual rating.
To retrieve a specific value you just need to specify the key using square brackets:
If you try to access a non existing element, the Python interpreter throws a Key Error exception:
To iterate the entire dictionary you can use .items() method, that returns an iterable objects of all the key-value pairs in tuples:
To iterate over the keys or over the values of the Python dictionary, you can use the .keys() or the .values() methods as well:
To insert another element into the dictionary you just need to assign a value to a new key:
To update the value of an existing key, just assign a different value to the previously inserted key.
Please note that since dictionaries are built on top of hash tables, you can only insert an element if its key is hashable. If the key of your element is not hashable (like a list, for example), the interpreter throws an TypeError exception:
To delete an element, you need to use the del statement, specifying the key you want to delete:
Deleting an entry doesn’t delete the actual value into the dictionary, it just replaces the key with a dummy value so that the open addressing collision resolution method will continue to work, but the interpreter handles all this complexity for you, ignoring the deleted element.
The Pythonic Implementation of Python Hash Tables
Now you know that dictionaries are Python hash tables but you may wonder how the implementation works under the hood, so in this chapter, I will try to give you some information about the actual implementation of Python Hash Tables.
Bear in mind that the information I will provide here is based on recent versions of Python, because with Python 3.6 dictionaries have changed a lot and are now smaller, faster and even more powerful, as they are now insertion ordered (the insertion ordered guarantee has been implemented in Python 3.6 but has officially be recognized by Guido in Python 3.7).
Try to create an empty Python dictionary and check its size and you will find out that an empty Python dictionary takes 240 bytes of memory:
By running this example you can see that the basic occupation of a Python dictionary is 240 bytes. But what happens if you decide to add a value? Well, that’s may seem odds, but the size doesn’t change:
So, why the size of the dictionary hasn’t changed? Because starting from Python 3.6 values are stored in a different data structure and the dictionary contains just a pointer to where the actual value is stored. Moreover, when you create an empty dictionary it starts creating a Python Hash Table with 8 buckets that are just 240 bytes long, so the first element in our dictionary hasn’t changed the size at all.
Now try to add some more elements and see how your dictionary behaves, you will see that the dictionary grows:
As you can see, the dict has grown after you have inserted the sixth and the eleventh element, but why? Because to make our Python hash table fast and reduce collisions, the interpreter keeps resizing the dictionary when it becomes full for two-third.
Now, try to delete all the elements in your dictionary, one at a time, and when you have finished, check the size again, you will find that even if the dictionary is empty, space hasn’t been freed:
This happens because since dictionaries have a really small memory footprint and the deletion is not frequent when working with dictionaries, the interpreter prefers to waste a little bit of space than to dynamically resize the dictionary after every deletion. However, if you empty your dictionary by calling the .clear() method, since it is a bulk deletion, space is freed and it goes to its minimum of 72 bytes:
As you may imagine, the first insertion on this dictionary will make the interpreter reserve the space for 8 buckets, going back to the initial situation.
In this article you have learned what are hash tables and how are they implemented in Python.
A huge part of this article is based on Raymond Hettinger’s speech at the Pycon 2017.
Raymond Hettinger is a Python core developer and its contribution to the Python development has been invaluable so far.
Did you find this article helpful?
Updated: August 21, 2020
You May Also Enjoy
The Sunday tip #2: Measuring Python code performance with the timeit module
Good code is also code that performs well, here’s how you can measure your code’s performance in Python
The Sunday tip #1: Python cached integers
Did you know that Python compiler optimize your program caching small integers?
Managing Python versions with pyenv
Are you sure that you are installing Python right?
How to create a computer virus in Python
Is it possible to create a self-replicating virus in Python? In this article, we’ll find out…