Double hashing

Double hashing is a computer programming technique used in conjunction with open-addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table .

It uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent hash function. Unlike the alternative collision-resolution methods of linear probing and quadratic probing, the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

Given two random, uniform, and independent hash functions and , the th location in the bucket sequence for value in a hash table of buckets is: Generally, and are selected from a set of universal hash functions; is selected to have a range of and to have a range of . Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of that any pair of keys will follow the same bucket sequence.

Selection of h2(k)

The secondary hash function should have several characteristics:

  • it should never yield an index of zero
  • it should cycle through the whole table
  • it should be very fast to compute
  • it should be pair-wise independent of
  • The distribution characteristics of are irrelevant. It is analogous to a random-number generator - it is only necessary that be ’’relatively prime’’ to |T|.

In practice, if division hashing is used for both functions, the divisors are chosen as primes.

Analysis

Let be the number of elements stored in , then 's load factor is . That is, start by randomly, uniformly and independently selecting two universal hash functions and to build a double hashing table . All elements are put in by double hashing using and . Given a key , the -st hash location is computed by:

Let have fixed load factor .

Bradford and Katehakis[1] showed the expected number of probes for an unsuccessful search in , still using these initially chosen hash functions, is regardless of the distribution of the inputs. Pair-wise independence of the hash functions suffices.

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity. Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.

Enhanced double hashing

Peter Dillinger's PhD thesis[2] points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in Bloom filters: If and , then and the sets of hashes are identical. This makes a collision twice as likely as the hoped-for .

There are additionally a significant number of mostly-overlapping hash sets; if and , then , and comparing additional hash values (expanding the range of ) is of no help.

Adding a quadratic term [3] (a triangular number) or even (triple hashing) to the hash function improves the hash function somewhat[3] but does not fix this problem; if:

and

then

Adding a cubic term [3] or (a tetrahedral number),[4] does solve the problem, a technique known as enhanced double hashing. This can be computed efficiently by forward differencing:

struct key;	// Opaque
extern unsigned int h1(struct key const *), h2(struct key const *);

// Calculate k hash values from two underlying hash function
// h1() and h2() using enhanced double hashing.  On return,
// hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6
// Takes advantage of automatic wrapping (modular reduction)
// of unsigned types in C.
void hash(struct key const *x, unsigned int hashes[], unsigned int n)
{
	unsigned int a = h1(x), b = h2(x), i;

	for (i = 0; i < n; i++) { 
		hashes[i] = a;
		a += b;	// Add quadratic difference to get cubic
		b += i;	// Add linear difference to get quadratic
		       	// i++ adds constant difference to get linear
	}
}
// Produces the same result, less legibly.
void hash_alt(struct key const *x, unsigned int hashes[], unsigned int n)
{
	unsigned int a = h1(x), b = h2(x), i;

	hashes[0] = a;
	for (i = i; i < n; )
		hashes[i] = a += b += i += 1;
}

See also

References

  1. Bradford, Phillip G.; Katehakis, Michael N. (April 2007), "A Probabilistic Study on Combinatorial Expanders and Hashing" (PDF), SIAM Journal on Computing, 37 (1): 83–111, doi:10.1137/S009753970444630X, MR 2306284, archived from the original (PDF) on 2016-01-25.
  2. Dillinger, Peter C. (December 2010). Adaptive Approximate State Storage (PDF) (PhD thesis). Northeastern University. pp. 93–112.
  3. Kirsch, Adam; Mitzenmacher, Michael (September 2008). "Less Hashing, Same Performance: Building a Better Bloom Filter" (PDF).  Random Structures and Algorithms. 33 (2): 187–218. CiteSeerX 10.1.1.152.579. doi:10.1002/rsa.20208.
  4. Dillinger, Peter C.; Manolios, Panagiotis (November 15–17, 2004). Bloom Filters in Probabilistic Verification (PDF). 5h International Conference on Formal Methods in Computer Aided Design (FMCAD 2004). Austin, Texas. CiteSeerX 10.1.1.119.628. doi:10.1007/978-3-540-30494-4_26.CS1 maint: date format (link)
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.