Hash table: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Nick Johnson
m (pigeon hole --> pigeonhole)
mNo edit summary
 
(16 intermediate revisions by 9 users not shown)
Line 1: Line 1:
In [[computer science]], a '''hash table''' is an unordered [[data structure]] which provides [[big O notation|O(1)]] insert or lookup.  A hash table operates on pairs of keys and values, where the key may be the value.
{{subpages}}


Hash tables are often used to implement dictionaries or [[set theory|sets]].  Every internet [[router]] uses a hash table to correlate ranges of [[MAC address|MAC addresses]] to route paths.
In [[computer science]], a '''hash table''' is an unordered, dictionary-like [[data structure]] that provides very efficient insertion, deletion or lookup of elements.  A hash table operates on pairs of keys and values; when a programmer presents a key, the hash table returns the corresponding value.


== How a Hash Table Works ==
Hash tables are often used to implement dictionaries or [[set theory|sets]].  Internet [[router|routers]] usually use a hash table to correlate ranges of [[IP address|IP addresses]] to route paths.


In short, a hash table is an [[array]] in which value elements are stored.  The address within that array is computed from the value element by the [[hash function]].  Any type of data can be used as key or value elements, and the key and value elements may be the same.  The only requirement is the existence of a hash function which maps objects of the key type to [[integer|integers]].
== How a hash table works ==


For instance, if we wanted to use a hash table to store telephone numbers, indexed against the owner's name, we would have [[string (computer science)|strings]] as key elements, and numbers as value elements.  Our hash function would, given a string, produce an integer. We would then store or retrieve that person's phone number of that index in the array of phone numbers.  By using this approach, we avoid costly search algorithms that must check multiple elements to find the one in question.  As a result, insertion, deletion, or lookup of an element in a hash table takes only [[big O notation|O(1)]].
A hash table is typically implemented as an [[array]] in which value elements are stored, with a value's address within that array ''computed'' by feeding the key to the [[hash function]].  Any type of data can be used as key or value elements, and the key and value elements may be the sameThe only requirement is the existence of a hash function which maps keys to hash values (almost always word-sized integers). For efficiency, the hash function should look in some sense like it maps each key to a random hash value; see [[universal hash function]] for an example formalization of this intuition.


There are a few special cases.   
For instance, if we wanted to use a hash table to store telephone numbers, indexed against the owner's name, we would have [[string (computer science)|strings]] of names as key elements, and strings of phone numbers as value elementsOur hash function would, given a string, produce an integer.  We would then store or retrieve a reference to that person's phone number at that index in the array of phone numbers.  For the cost of executing the hash function once, we go directly to the phone number, avoiding costly search algorithms that must check multiple elements to find the one in question.  As a result, insertion, deletion, or lookup of an element in a hash table takes only [[big O notation|O(1)]], which means that the time necessary to perform these actions does not grow with the number of elements in the table.


=== Hash Collisions ===
=== Hash collisions ===


Since some key types are not [[enumerable]] (read: they have a larger [[cardinality]] than integers), the hash function of such keys must sometimes produce the same hash value for two distinct keys.
Hash tables execute on real computers, so the array of value elements must be [[finite]].  Hash algorithms tend to compute the hash of each key element [[modulo]] the size of the array.  As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the [[pigeonhole principle]], some elements must be assigned to the same array indices.  This situation is called a ''hash collision,'' since two keys collide on a particular array element.  A hash function that produces no collisions is called a [[perfect hash function]].
Similarly, because hash tables execute on real computers, the array of value elements must be [[finite]].  Generally, hash algorithms will compute the hash of each key element [[modulo]] the size of the array.  As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the [[pigeonhole principle]], some elements must be assigned to the same array indices.  This situation is called a ''hash collision,'' since two keys collide on a paticular array element.  A hash function that produces no collisions is called a [[perfect hash function]].


A programmer can minimize the occurrance of hash collisions by carefully selecting a hash function.  Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application.  In all but a few special situations, it is impossible to totally avoid hash collisions.
A programmer can minimize the occurrence of hash collisions by carefully selecting a hash function.  Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application.  In all but a few special situations, it is impossible to totally avoid hash collisions.  A good hash function should result in distributing the computed addresses fairly evenly over the range of addresses in the array for the type of keys being used, but a hash table implementation must handle the situation where two different keys yield the same address.


Several variants of hash tables exist which handle hash collisions in different ways.   
There are several known strategies for handling hash table collisions.   


# [[separate chaining]]---by far, the simplest---stores an [[array]] of [[linked list|linked lists]] of value elements.  If two keys have the same hash value, then they are both inserted into the list.  In the expected case, when each list contains no more than one element, then insertions, deletions and lookups are still [[big O notation|O(1)]].  If, on the other hand, the hash function has poor [[distribution]], some lists may grow very long.  In that case, insertions, deletions and lookups slow to [[big O notation|O(n)]], where n is the length of the longest list.
# [[separate chaining]] (also known as ''bucket hashing'')---by far, the simplest---stores an [[array]] of [[linked list|linked lists]] of value elements.  If two keys have the same hash value, then they are both inserted into the list.  In the expected case, when each list contains no more than one element, then insertions, deletions and lookups are still [[big O notation|O(1)]].  If, on the other hand, the hash function has poor [[distribution]], some lists may grow very long.  In that case, insertions, deletions and lookups slow to [[big O notation|O(n)]], where n is the length of the longest list.
# [[linear probing]] attempts to store the element in the next available array index, tested in sequential order (for instance, the nth, the n+1th, the n+2th, and so on).  In that sense, linear probing is an extension to the hash function.  Linear probing has two major problems:  first, it tends to worsen the distribution of the hash function by creating clusters of occupied array indices, and second it gives the hash table a finite size since the number of array indices does not increase.  In the worst case insertions, deletions and lookups slow to [[big O notation|O(n)]], where n is the length of the longest cluster.
# [[linear probing]] attempts to store the element in the next available array index, tested in sequential order (for instance, the nth, the n+1th, the n+2th, and so on).  In that sense, linear probing is an extension to the hash function.  Linear probing has two major problems:  first, it tends to worsen the distribution of the hash function by creating clusters of occupied array indices, and second it gives the hash table a finite size since the number of array indices does not increase.  In the worst case insertions, deletions and lookups slow to [[big O notation|O(n)]], where n is the length of the longest cluster.
# [[quadratic probing]] is very similar to linear probing, except a different sequence is used to determine next available array index.  Instead of testing them sequentially, quadratic probing tests the nth, the nth+1, the nth+2, then nth+4, the nth+9, and  so on.  Unlike linear probing, quadratic probing does not tend to produce clusters.
# [[quadratic probing]] is very similar to linear probing, except a different sequence is used to determine next available array index.  Instead of testing them sequentially, quadratic probing tests the nth, the nth+1, the nth+2, then nth+4, the nth+9, and  so on.  Unlike linear probing, quadratic probing does not tend to produce clusters.
# [[double hashing]] stores a hash table of hash tables.  This method is most effective when two [[orthogonal]] hash functions can be found, and is thus limited by problem domain.
# [[double hashing]] stores a hash table of hash tables.  This method is most effective when two [[orthogonal]] hash functions can be found, and is thus limited by problem domain.[[Category:Suggestion Bot Tag]]

Latest revision as of 06:00, 26 August 2024

This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

In computer science, a hash table is an unordered, dictionary-like data structure that provides very efficient insertion, deletion or lookup of elements. A hash table operates on pairs of keys and values; when a programmer presents a key, the hash table returns the corresponding value.

Hash tables are often used to implement dictionaries or sets. Internet routers usually use a hash table to correlate ranges of IP addresses to route paths.

How a hash table works

A hash table is typically implemented as an array in which value elements are stored, with a value's address within that array computed by feeding the key to the hash function. Any type of data can be used as key or value elements, and the key and value elements may be the same. The only requirement is the existence of a hash function which maps keys to hash values (almost always word-sized integers). For efficiency, the hash function should look in some sense like it maps each key to a random hash value; see universal hash function for an example formalization of this intuition.

For instance, if we wanted to use a hash table to store telephone numbers, indexed against the owner's name, we would have strings of names as key elements, and strings of phone numbers as value elements. Our hash function would, given a string, produce an integer. We would then store or retrieve a reference to that person's phone number at that index in the array of phone numbers. For the cost of executing the hash function once, we go directly to the phone number, avoiding costly search algorithms that must check multiple elements to find the one in question. As a result, insertion, deletion, or lookup of an element in a hash table takes only O(1), which means that the time necessary to perform these actions does not grow with the number of elements in the table.

Hash collisions

Hash tables execute on real computers, so the array of value elements must be finite. Hash algorithms tend to compute the hash of each key element modulo the size of the array. As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the pigeonhole principle, some elements must be assigned to the same array indices. This situation is called a hash collision, since two keys collide on a particular array element. A hash function that produces no collisions is called a perfect hash function.

A programmer can minimize the occurrence of hash collisions by carefully selecting a hash function. Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application. In all but a few special situations, it is impossible to totally avoid hash collisions. A good hash function should result in distributing the computed addresses fairly evenly over the range of addresses in the array for the type of keys being used, but a hash table implementation must handle the situation where two different keys yield the same address.

There are several known strategies for handling hash table collisions.

  1. separate chaining (also known as bucket hashing)---by far, the simplest---stores an array of linked lists of value elements. If two keys have the same hash value, then they are both inserted into the list. In the expected case, when each list contains no more than one element, then insertions, deletions and lookups are still O(1). If, on the other hand, the hash function has poor distribution, some lists may grow very long. In that case, insertions, deletions and lookups slow to O(n), where n is the length of the longest list.
  2. linear probing attempts to store the element in the next available array index, tested in sequential order (for instance, the nth, the n+1th, the n+2th, and so on). In that sense, linear probing is an extension to the hash function. Linear probing has two major problems: first, it tends to worsen the distribution of the hash function by creating clusters of occupied array indices, and second it gives the hash table a finite size since the number of array indices does not increase. In the worst case insertions, deletions and lookups slow to O(n), where n is the length of the longest cluster.
  3. quadratic probing is very similar to linear probing, except a different sequence is used to determine next available array index. Instead of testing them sequentially, quadratic probing tests the nth, the nth+1, the nth+2, then nth+4, the nth+9, and so on. Unlike linear probing, quadratic probing does not tend to produce clusters.
  4. double hashing stores a hash table of hash tables. This method is most effective when two orthogonal hash functions can be found, and is thus limited by problem domain.