15-9 Breaking a string

A certain string-processing language allows a programmer to break a string into two pieces. Because this operation copies the string, it costs $n$ time units to break a string of $n$ characters into two pieces. Suppose a programmer wants to break a string into many pieces. The order in which the breaks occur can affect the total amount of time used. For example, suppose that the programmer wants to break a $20$-character string after characters $2$, $8$, and $10$ (numbering the characters in ascending order from the left-hand end, starting from $1$). If she programs the breaks to occur in left-to-right order, then the first break costs $20$ time units, the second break costs $18$ time units (breaking the string from characters $3$ to $20$ at character $8$), and the third break costs $12$ time units, totaling $50$ time units. If she programs the breaks to occur in right-to-left order, however, then the first break costs $20$ time units, the second break costs $10$ time units, and the third break costs $8$ time units, totaling $38$ time units. In yet another order, she could break first at $8$ (costing $20$), then break the left piece at $2$ (costing $8$), and finally the right piece at $10$ (costing $12$), for a total cost of $40$.

Design an algorithm that, given the numbers of characters after which to break, determines a least-cost way to sequence those breaks. More formally, given a string $S$ with $n$ characters and an array $L[1..m]$ containing the break points, com- pute the lowest cost for a sequence of breaks, along with a sequence of breaks that achieves this cost.

Our first step will be to identify the subproblems that satisfy the optimalsubstructure property. Before we frame the subproblem, we make two simplifying modifications to the input:

  • We sort $L$ so that the indices in $L$ are in ascending order.
  • We prepend the index $0$ to the beginning of $L$ and append n to the end of $L$.

Let $L[i..j]$ denote a subarray of $L$ that starts from index $i$ and ends at index $j$. Define the subproblem denoted by $(i, j)$ as "What is the cheapest sequence of breaks to break the substring $S[L[i] + 1..L[j]]$?" Note that the first and last elements of the subarray $L[i..j]$ define the ends of the substring, and we have to worry about only the indices of the subarray $L[i + 1..j - 1]$.

For example, let $L = \langle 20, 17, 14, 11, 25 \rangle$ and $n = 30$. First, we sort $L$. Then, we prepend $0$ and append $n$ as explained to get $L = \langle 0, 11, 14, 17, 20, 25, 30 \rangle$. Now, what is the subproblem $(2, 6)$? We obtain a substring by breaking $S$ after character $L[2] = 11$ and character $L[6] = 25$. We ask "What is the cheapest sequence of breaks to break the substring $S[12..25]$?" We have to worry about only indices in the subarray $L[3..5] = \langle 14, 17, 20 \rangle$, since the other indices are not present in the substring.

At this point, the problem looks similar to matrix-chain multiplication (see Section 15.2). We can make the first break at any element of $L[i + 1..j - 1]$.

Suppose that an optimal sequence of breaks $\sigma$ for subproblem $(i, j)$ makes the first break at $L[k]$, where $i < k < j$. This break gives rise to two subproblems:

  • The "prefix" subproblem $(i, k)$, covering the subarray $L[i + 1..k - 1]$,
  • The "suffix" subproblem $(k, j)$, covering the subarray $L[k + 1..j - 1]$.

The overall cost can be expressed as the sum of the length of the substring, the prefix cost, and the suffix cost.

We show optimal substructure by claiming that the sequence of breaks in for the prefix subproblem $(i, k)$ must be an optimal one. Why? If there were a less costly way to break the substring $S[L[i] + 1..L[k]]$ represented by the subproblem $(i, k)$, then substituting that sequence of breaks in $\sigma$ would produce another sequence of breaks whose cost is lower than that of $\sigma$, which would be a contradiction. A similar observation holds for the sequence of breaks for the suffix subproblem $(k, j)$: it must be an optimal sequence of breaks.

Let $cost[i, j]$ denote the cost of the cheapest solution to subproblem $(i, j)$. We write the recurrence relation for cost as

$$ cost[i, j] = \begin{cases} 0 & \text{if $j - i \le 1$}, \\ \min\limits_{i < k < j} {cost[i, k] + cost[k, j] + (L[j] - L[i])} & \text{if $j - i > 1$}. \end{cases} $$

Thus, our approach to solving the subproblem $(i, j)$ will be to try to split the respective substring at all possible values of $k$ and then choosing a break that results in the minimum cost. We need to be careful to solve smaller subproblems before we solve larger subproblems. In particular, we solve subproblems in increasing order of the length $j - 1$.

    prepend 0 to the start of L and append n to the end of L
    m = L.length
    sort L into increasing order
    let cost[1..m, 1..m] and break[1..m, 1..m] be new tables
    for i = 1 to m - 1
        cost[i, i] = cost[i, i + 1] = 0
    cost[m, m] = 0
    for len = 3 to m
        for i = 1 to m - len + 1
            j = i + len - 1
            cost[i, j] = 
            for k = i + 1 to j - 1
                if cost[i, k] + cost[k, j] < cost[i, j]
                    cost[i, j] = cost[i, k] + cost[k, j]
                    break[i, j] = k
            cost[i, j] = cost[i, j] + L[j] - L[i]
    print "The minimum cost of breaking the string is" cost[1, m]
    PRINT-BREAKS(L, break, 1, m)

After sorting $L$, we initialize the base cases, in which $i = j$ or $j = i + 1$.

The nested for loops represent the main computation. The outermost for loop runs for $len = 3$ to $m$, which means that we need to consider subarrays of $L$ with length at least $3$, since the first and the last element define the substring, and we need at least one more element to specify a break. The increasing values of $len$ also ensures that we solve subproblems with smaller length before we solve subproblems with greater length.

The inner for loop on $i$ runs from $1$ to $m - len + 1$. The upper bound of $m - len + 1$ is the largest value that the start index $i$ can take such that $i + len - 1 \le m$.

In the innermost for loop, we try each possible location k as the place to make the first break for subproblem $(i, j)$. The first such place is $L[i + 1]$, and not $L[i]$, since $L[i]$ represents the start of the substring (and thus not a valid place for a break). Similarly, the last valid place is $L[j - 1]$, because $L[j]$ represents the end of the substring.

The if condition tests whether $k$ is the best place for a break found so far, and it updates the best value in $cost[i, j]$ if so. We use $break[i, j]$ to record that the best place for the first break is $k$. Specifically, if $break[i, j] = k$, then an optimal sequence of breaks for $(i, j)$ makes the first break at $L[k]$.

Finally, we add the length of the substring $L[j] - L[i]$ to $cost[i, j]$ because, irrespective of what we choose as the first break, it costs us a price equal to the length of the substring to make a break.

The lowest cost for the original problem ends up in $cost[1, m]$. By our initialization, $L[1] = 0$ and $L[m] = n$. Thus, $cost[1, m]$ will hold the optimum price of cutting the substring from $L[1] + 1 = 1$ to $L[m] = n$, which is the entire string.

The running time is $\Theta(m^3)$, and it is dictated by the three nested for loops. They fill in the entries above the main diagonal of the two tables, except for entries in which $j = i + 1$. That is, they fill in rows $i = 1, 2, \ldots, m - 2$, entries $j = i + 2, i + 3, \ldots, m$. When filling in entry $[i, j]$, we check values of $k$ running from $i + 1$ to $j - 1$, or $j - i - 1$ entries. Thus, the total number of iterations of the innermost for loop is

$$ \begin{aligned} \sum{i = 1}^{m - 2} \sum{j = i + 2}^m (j - i - 1) & = \sum_{i = 1}^{m - 2} \sum{d = 1}^{m - i - 1} d & \text{($d = j - i - 1$)} \\ & = \sum_{i = 1}^{m - 2} \Theta((m - i)^2) & \text{(equation (A.2))} \\ & = \sum_{h = 2}^{m - 1} \Theta(h^2) & (h = m - i) \\ & = \Theta(m^3). & \text{(equation (A.3))} \end{aligned} $$

Since each iteration of the innermost for loop takes constant time, the total running time is $\Theta(m^3)$. Note in particular that the running time is independent of the length of the string $n$.

PRINT-BREAKS(L, break, i, j)
    if j - i  2
        print "Break at" L[k]
        PRINT-BREAKS(L, break, i, k)
        PRINT-BREAKS(L, break, k, j)

$\text{PRINT-BREAKS}$ uses the information stored in $break$ to print out the actual sequence of breaks.