11.3 Hash functions

11.3-1¶

Suppose we wish to search a linked list of length $n$, where each element contains a key $k$ along with a hash value $h(k)$. Each key is a long character string. How might we take advantage of the hash values when searching the list for an element with a given key?

If every element also contained a hash of the long character string, when we are searching for the desired element, we'll first check if the hashvalue of the node in the linked list, and move on if it disagrees. This can increase the runtime by a factor proportional to the length of the long character strings.

11.3-2¶

Suppose that we hash a string of $r$ characters into $m$ slots by treating it as a radix-128 number and then using the division method. We can easily represent the number $m$ as a 32-bit computer word, but the string of $r$ characters, treated as a radix-128 number, takes many words. How can we apply the division method to compute the hash value of the character string without using more than a constant number of words of storage outside the string itself?

    sum = 0
    for i = 1 to r
        sum = (sum * 128 + s[i]) % m

Use sum as the key.

11.3-3¶

Consider a version of the division method in which $h(k) = k \mod m$, where $m = 2^p - 1$ and $k$ is a character string interpreted in radix $2^p$. Show that if we can derive string $x$ from string $y$ by permuting its characters, then $x$ and $y$ hash to the same value. Give an example of an application in which this property would be undesirable in a hash function.

We will show that each string hashes to the sum of it's digits $\mod 2^p − 1$. We will do this by induction on the length of the string.

Base case

Suppose the string is a single character, then the value of that character is the value of $k$ which is then taken $\mod m$.
Inductive step.

Let $w = w_1w_2$ where $|w_1| \ge 1$ and $|w_2| = 1$. Suppose $h(w_1) = k_1$. Then, $h(w) = h(w_1)2^p + h(w_2) \mod 2^p − 1 = h(w_1) + h(w_2) \mod 2^p − 1$. So, since $h(w_1)$ was the sum of all but the last digit $\mod m$, and we are adding the last digit $\mod m$, we have the desired conclusion.

11.3-4¶

Consider a hash table of size $m = 1000$ and a corresponding hash function $h(k) = \lfloor m (kA \mod 1) \rfloor$ for $A = (\sqrt 5 - 1) / 2$. Compute the locations to which the keys $61$, $62$, $63$, $64$, and $65$ are mapped.

$h(61) = \lfloor 1000(61 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 700$.
$h(62) = \lfloor 1000(62 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 318$.
$h(63) = \lfloor 1000(63 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 936$.
$h(64) = \lfloor 1000(64 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 554$.
$h(65) = \lfloor 1000(65 \cdot \frac{\sqrt 5 - 1}{2} \mod 1) \rfloor = 172$.

11.3-5 $\star$¶

Define a family $\mathcal H$ of hash functions from a finite set $U$ to a finite set $B$ to be $\epsilon$-universal if for all pairs of distinct elements $k$ and $l$ in $U$,

$$\Pr\{h(k) = h(l)\} \le \epsilon,$$

where the probability is over the choice of the hash function $h$ drawn at random from the family $\mathcal H$. Show that an $\epsilon$-universal family of hash functions must have

$$\epsilon \ge \frac{1}{|B|} - \frac{1}{|U|}.$$

As a simplifying assumption, assume that $|B|$ divides $|U|$. It's just a bit messier if it doesn't divide evenly.

Suppose to a contradiction that $\epsilon > \frac{1}{|B|} - \frac{1}{|U|}$. This means that $\forall$ pairs $k, \ell \in U$, we have that the number $n_{k, \ell}$ of hash functions in $\mathcal H$ that have a collision on those two elements satisfies $n_{k, \ell} \le \frac{|\mathcal H}{|B|} - \frac{|\mathcal H}{|U|}$. So, summing over all pairs of elements in $U$, we have that the total number is $\le \frac{|\mathcal H||U|^2}{2|B|} - \frac{|\mathcal H||U|}{2}$.

Any particular hash function must have that there are at least $|B|\binom{|U| / |B|}{2} = |B|\frac{|U|^2 - |U||B|}{2|B|^2} = \frac{|U|^2}{2|B|} - \frac{|U|}{2}$ colliding pairs for that hash function, summing over all hash functions, we get that there are at least $|\mathcal H| \left(\frac{|U|^2}{2|B|} - \frac{|U|}{2}\right)$ colliding pairs total. Since we have that there are at most some number less than this many, we have a contradiction, and so must have the desired restriction on $\epsilon$.

11.3-6 $\star$¶

Let $U$ be the set of $n$-tuples of values drawn from $\mathbb Z_p$, and let $B = \mathbb Z_p$, where $p$ is prime. Define the hash function $h_b: U \rightarrow B$ for $b \in \mathbb Z_p$ on an input $n$-tuple $\langle a_0, a_1, \ldots, a_{n - 1} \rangle$ from $U$ as

$$h_b(\langle a_0, a_1, \ldots, a_{n - 1} \rangle) = \Bigg(\sum_{j = 0}^{n - 1} a_jb^j \Bigg) \mod p,$$

and let $\mathcal{H} = \{h_b : b \in \mathbb Z_p\}$. Argue that $\mathcal H$ is $((n - 1) / p)$-universal according to the definition of $\epsilon$-universal in Exercise 11.3-5. ($\textit{Hint:}$ See Exercise 31.4-4.)

Fix $b \in \mathbb Z_p$. By exercise 31.4-4, $h_b(x)$ collides with $h_b(y)$ for at most $n - 1$ other $y \in U$. Since there are a total of $p$ possible values that $h_b$ takes on, the probability that $h_b(x) = h_b(y)$ is bounded from above by $\frac{n - 1}{p}$, since this holds for any value of $b$, $\mathcal H$ is $((n - 1 ) /p)$-universal.