Skip to content

11.3 Hash functions

11.3-1

Suppose we wish to search a linked list of length $n$, where each element contains a key $k$ along with a hash value $h(k)$. Each key is a long character string. How might we take advantage of the hash values when searching the list for an element with a given key?

If every element also contained a hash of the long character string, when we are searching for the desired element, we'll first check if the hashvalue of the node in the linked list, and move on if it disagrees. This can increase the runtime by a factor proportional to the length of the long character strings.

11.3-2

Suppose that we hash a string of $r$ characters into $m$ slots by treating it as a radix-128 number and then using the division method. We can easily represent the number $m$ as a 32-bit computer word, but the string of $r$ characters, treated as a radix-128 number, takes many words. How can we apply the division method to compute the hash value of the character string without using more than a constant number of words of storage outside the string itself?

1
2
3
    sum = 0
    for i = 1 to r
        sum = (sum * 128 + s[i]) % m

Use sum as the key.

11.3-3

Consider a version of the division method in which $h(k) = k \mod m$, where $m = 2^p - 1$ and $k$ is a character string interpreted in radix $2^p$. Show that if we can derive string $x$ from string $y$ by permuting its characters, then $x$ and $y$ hash to the same value. Give an example of an application in which this property would be undesirable in a hash function.

First, we observe that we can generate any permutation by a sequence of interchanges of pairs of characters. One can prove this property formally, but informally, consider that both heapsort and quicksort work by interchanging pairs of elements and that they have to be able to produce any permutation of their input array. Thus, it suffices to show that if string $x$ can be derived from string $y$ by interchanging a single pair of characters, then $x$ and $y$ hash to the same value.

Let us denote the $i$th character in $x$ by $x_i$, and similarly for $y$. The interpretation of $x$ in radix $2^p$ is $\sum_{i = 0}^{n - 1} x_i 2^{ip}$, and so $h(x) = (\sum_{i = 0}^{n - 1} x_i 2^{ip}) \mod (2^p - 1)$. Similarly, $h(y) = (\sum_{i = 0}^{n - 1} y_i 2^{ip}) \mod (2^p - 1)$.

Suppose that $x$ and $y$ are identical strings of $n$ characters except that the characters in positions $a$ and $b$ are interchanged: $x_a = y_b$ and $y_a = x_b$. Without loss of generality, let $a > b$. We have

$$h(x) - h(y) = \Big(\sum_{i = 0}^{n - 1} x_i 2^{ip}\Big) \mod (2^p - 1) - \Big(\sum_{i = 0}^{n - 1} y_i 2^{ip}\Big) \mod (2^p - 1).$$

Since $0 \le h(x)$, $h(y) < 2^p - 1$, we have that $-(2^p - 1) < h(x) - h(y) < 2^p - 1$. If we show that $(h(x) - h(y)) \mod (2^p - 1) = 0$, then $h(x) = h(y)$.

Since the sums in the hash functions are the same except for indices $a$ and $b$, we have

$$ \begin{aligned} (h(x) - h(y)) \mod (2^p - 1) & = ((x_a 2^{ap} + x_b 2^{bp}) - (y_a 2^{ap} + y_b 2^{bp})) \mod (2^p - 1) \\ & = ((x_a 2^{ap} + x_b 2^{bp}) - (x_b 2^{ap} + x_a 2^{bp})) \mod (2^p - 1) \\ & = ((x_a - x_b)2^{ap} - (x_a - x_b) 2^{bp}) \mod (2^p - 1) \\ & = ((x_a - x_b)(2^{ap} - 2^{bp})) \mod (2^p - 1) \\ & = ((x_a - x_b)2^{bp}(2^{(a - b)p} - 1)) \mod (2^p - 1). \end{aligned} $$

By equation $\text{(A.5)}$,

$$\sum_{i = 0}^{a - b - 1} 2^{pi} = \frac{2^{(a - b)p} - 1}{2^p - 1},$$

and multiplying both sides by $s^p - 1$, we get $2^{(a - b)p} - 1 = \big(\sum_{i = 0}^{a - b - 1} 2^{pi}\big)(2^p - 1)$. Thus,

$$ \begin{aligned} (h(x) - h(y))\mod(2^p - 1) & = \Bigg((x_a - x_b)2^{bp}\Bigg(\sum_{i = 0}^{a - b - 1} 2^{pi}\Bigg)(2^p - 1)\Bigg) \mod (2^p - 1) \\ & = 0, \end{aligned} $$

since one of the factors is $2^p - 1$.

We have shown that $(h(x) - h(y)) \mod (2^p - 1) = 0$, and so $h(x) = h(y)$.

11.3-4

Consider a hash table of size $m = 1000$ and a corresponding hash function $h(k) = \lfloor m (kA \mod 1) \rfloor$ for $A = (\sqrt 5 - 1) / 2$. Compute the locations to which the keys $61$, $62$, $63$, $64$, and $65$ are mapped.

  • $h(61) = \lfloor 1000(61 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 700$.
  • $h(62) = \lfloor 1000(62 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 318$.
  • $h(63) = \lfloor 1000(63 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 936$.
  • $h(64) = \lfloor 1000(64 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 554$.
  • $h(65) = \lfloor 1000(65 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 172$.

11.3-5 $\star$

Define a family $\mathcal H$ of hash functions from a finite set $U$ to a finite set $B$ to be $\epsilon$-universal if for all pairs of distinct elements $k$ and $l$ in $U$,

$$\Pr\{h(k) = h(l)\} \le \epsilon,$$

where the probability is over the choice of the hash function $h$ drawn at random from the family $\mathcal H$. Show that an $\epsilon$-universal family of hash functions must have

$$\epsilon \ge \frac{1}{|B|} - \frac{1}{|U|}.$$

Let $b = |B|$ and $u = |U|$. We start by showing that the total number of collisions is minimized by a hash function that maps $u / b$ elements of $U$ to each of the $b$ values in $B$. For a given hash function, let $u_j$ be the number of elements that map to $j \in B$. We have $u = \sum_{j \in B} u_j$. We also have that the number of collisions for a given value of $j \in B$ is $\binom{u_j}{2} = u_j(u_j - 1) / 2$.

Lemma

The total number of collisions is minimized when $u_j = u / b$ for each $j \in B$.

Proof

If $u_j \le u / b$, let us call $j$ underloaded, and if $u_j \ge u / b$, let us call $j$ overloaded. Consider an unbalanced situation in which $u_j \ne u / b$ for at least one value $j \in B$. We can think of converting a balanced situation in which all $u_j$ equal $u / b$ into the unbalanced situation by repeatedly moving an element that maps to an underloaded value to map instead to an overloaded value. (If you think of the values of $B$ as representing buckets, we are repeatedly moving elements from buckets containing at most $u / b$ elements to buckets containing at least $u / b$ elements.)

We now show that each such move increases the number of collisions, so that all the moves together must increase the number of collisions. Suppose that we move an element from an underloaded value $j$ to an overloaded value $k$, and we leave all other elements alone. Because $j$ is underloaded and $k$ is overloaded, $u_j \le u / b\le u_k$. Considering just the collisions for values $j$ and $k$, we have $u_j(u_j - 1) / 2 + u_k(u_k - 1) / 2$ collisions before the move and $(u_j - 1)(u_j - 2) / 2 + (u_k + 1)u_k / 2$ collisions afterward. We wish to show that

$$u_j(u_j - 1) / 2 + u_k(u_k - 1) / 2 < (u_j - 1)(u_j - 2) / 2 + (u_k + 1)u_k / 2.$$

We have the following sequence of equivalent inequalities:

$$ \begin{aligned} u_j & < u_k + 1 \\ 2u_j & < 2u_k + 2 \\ -u_k & < u_k - 2u_j + 2 \\ u_j^2 - u_j + u_k^2 - u_k & < u_j^2 - 3u_j + 2 + u_k^2 + u_k \\ u_j(u_j - 1) + u_k(u_k - 1) & < (u_j - 1)(u_j - 2) + (u_k + 1)u_k \\ u_j(u_j - 1) / 2 + u_k(u_k - 1) / 2 & < (u_j - 1)(u_j - 2) / 2 + (u_k + 1)u_k / 2. \end{aligned} $$

Thus, each move increases the number of collisions. We conclude that the number of collisions is minimized when $u_j = u / b$ for each $j \in B$.

By the above lemma, for any hash function, the total number of collisions must be at least $b(u / b)(u / b - 1) / 2$. The number of pairs of distinct elements is $\binom{u}{2} = u(u - 1) / 2$. Thus, the number of collisions per pair of distinct elements must be at least

$$ \begin{aligned} \frac{b(u / b)(u / b - 1) / 2}{u(u - 1) / 2} & = \frac{u / b - 1}{u - 1} \\ & > \frac{u / b - 1}{u} \\ & = \frac{1}{b} - \frac{1}{u}. \end{aligned} $$

Thus, the bound on the probability of a collision for any pair of distinct elements can be no less than $1 / b - 1 / u = 1 / |B| - 1 / |U|$.

11.3-6 $\star$

Let $U$ be the set of $n$-tuples of values drawn from $\mathbb Z_p$, and let $B = \mathbb Z_p$, where $p$ is prime. Define the hash function $h_b: U \rightarrow B$ for $b \in \mathbb Z_p$ on an input $n$-tuple $\langle a_0, a_1, \ldots, a_{n - 1} \rangle$ from $U$ as

$$h_b(\langle a_0, a_1, \ldots, a_{n - 1} \rangle) = \Bigg(\sum_{j = 0}^{n - 1} a_jb^j \Bigg) \mod p,$$

and let $\mathcal{H} = \{h_b : b \in \mathbb Z_p\}$. Argue that $\mathcal H$ is $((n - 1) / p)$-universal according to the definition of $\epsilon$-universal in Exercise 11.3-5. ($\textit{Hint:}$ See Exercise 31.4-4.)

Fix $b \in \mathbb Z_p$. By exercise 31.4-4, $h_b(x)$ collides with $h_b(y)$ for at most $n - 1$ other $y \in U$. Since there are a total of $p$ possible values that $h_b$ takes on, the probability that $h_b(x) = h_b(y)$ is bounded from above by $\frac{n - 1}{p}$, since this holds for any value of $b$, $\mathcal H$ is $((n - 1 ) /p)$-universal.