32.1 The naive string-matching algorithm

32.1-1

Show the comparisons the naive string matcher makes for the pattern $P = 0001$ in the text $T = 000010001010001$.

STRING-MATCHER(P, T, i)
    for j = i to i + P.length
        if P[j - i + 1] != T[j]
            return false
    return true

32.1-2

Suppose that all characters in the pattern $P$ are different. Show how to accelerate $\text{NAIVE-STRING-MATCHER}$ to run in time $O(n)$ on an $n$-character text $T$.

Suppose $T[i] \ne P[j]$, then for $k \in [1, j)$, $T[i - k] = P[j - k] \ne P[0]$, the $[i - k, i)$ are all invalid shifts which could be skipped, therefore we can compare $T[i]$ with $P[0]$ in the next iteration.

32.1-3

Suppose that pattern $P$ and text $T$ are randomly chosen strings of length $m$ and $n$, respectively, from the $d$-ary alphabet $\Sigma_d = \{ 0, 1, \ldots, d - 1 \}$, where $d \ge 2$. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is

$$(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2(n - m + 1)$$

over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

Suppose for each shift, the number of compared characters is $L$, then:

$$ \begin{aligned} \text E[L] & = 1 \cdot \frac{d - 1}{d} + 2 \cdot (\frac{1}{d})^1 \frac{d - 1}{d} + \cdots + m \cdot (\frac{1}{d})^{m - 1} \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\ & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m}. \end{aligned} $$

$$ \begin{aligned} S & = 1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m - 1} \\ \frac{1}{d}S & = 1 \cdot (\frac{1}{d})^1 + \cdots + (m - 1) \cdot (\frac{1}{d})^{m - 1} + m \cdot (\frac{1}{d})^{m} \\ \frac{d - 1}{d}S & = 1 + (\frac{1}{d})^1 + \cdots + \cdot (\frac{1}{d})^{m - 1} - m \cdot (\frac{1}{d})^{m} \\ \frac{d - 1}{d}S & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m}. \end{aligned} $$

$$ \begin{aligned} \text E[L] & = (1 + 2 \cdot (\frac{1}{d})^1 + \cdots + m \cdot (\frac{1}{d})^{m}) \frac{d - 1}{d} + m \cdot (\frac{1}{d})^{m} \\ & = \frac{1 - d^{-m}}{1 - d^{-1}} - m \cdot (\frac{1}{d})^{m} + m \cdot (\frac{1}{d})^{m} \\ & = \frac{1 - d^{-m}}{1 - d^{-1}}. \end{aligned} $$

There are $n - m + 1$ shifts, therefore the expected number of comparisons is:

$$(n - m + 1) \cdot \text E[L] = (n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}}$$

Since $d \ge 2$, $1 - d^{-1} \ge 0.5$, $1 - d^{-m} < 1$, and $ \frac{1 - d^{-m}}{1 - d^{-1}} \le 2$, therefore

$$(n - m + 1) \frac{1 - d^{-m}}{1 - d^{-1}} \le 2 (n - m + 1).$$

32.1-4

Suppose we allow the pattern $P$ to contain occurrences of a gap character $\diamond$ that can match an arbitrary string of characters (even one of zero length). For example, the pattern $ab\diamond ba\diamond c$ occurs in the text $cabccbacbacab$ as

$$c \underbrace{ab}_{ab} \underbrace{cc}_{\diamond} \underbrace{ba}_{ba} \underbrace{cba}_{\diamond} \underbrace{c}_{c} ab$$

and as

$$c \underbrace{ab}_{ab} \underbrace{ccbac}_{\diamond} \underbrace{ba}_{ba} \underbrace{\text{ }}_{\diamond} \underbrace{c}_{c} ab$$

Note that the gap character may occur an arbitrary number of times in the pattern but not at all in the text. Give a polynomial-time algorithm to determine whether such a pattern $P$ occurs in a given text $T$, and analyze the running time of your algorithm.

By using dynamic programming, the time complexity is $O(mn)$ where $m$ is the length of the text $T$ and $n$ is the length of the pattern $P$; the space complexity is $O(mn)$, too.

This problem is similar to LeetCode 44. WildCard Matching, except that it has no question mark (?) requirement. You can see my naive DP implementation here.