15-5 Edit distance

In order to transform one source string of text $x[1..m]$ to a target string $y[1..n]$, we can perform various transformation operations. Our goal is, given $x$ and $y$, to produce a series of transformations that change $x$ to $y$. We use an array $z$—assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, $z$ is empty, and at termination, we should have $z[j] = y[j]$ for $j = 1, 2, \ldots, n$. We maintain current indices $i$ into $x$ and $j$ into $z$, and the operations are allowed to alter $z$ and these indices. Initially, $i = j = 1$. We are required to examine every character in $x$ during the transformation, which means that at the end of the sequence of transformation operations, we must have $i = m + 1$.

We may choose from among six transformation operations:

Copy a character from $x$ to $z$ by setting $z[j] = x[i]$ and then incrementing both $i$ and $j$. This operation examines $x[i]$.

Replace a character from $x$ by another character $c$, by setting $z[j] = c$, and then incrementing both $i$ and $j$. This operation examines $x[i]$.

Delete a character from $x$ by incrementing $i$ but leaving $j$ alone. This operation examines $x[i]$.

Insert the character $c$ into $z$ by setting $z[j] = c$ and then incrementing $j$, but leaving $i$ alone. This operation examines no characters of $x$.

Twiddle (i.e., exchange) the next two characters by copying them from $x$ to $z$ but in the opposite order; we do so by setting $z[j] = x[i + 1]$ and $z[j + 1] = x[i]$ and then setting $i = i + 2$ and $j = j + 2$. This operation examines $x[i]$ and $x[i + 1]$.

Kill the remainder of $x$ by setting $i = m + 1$. This operation examines all characters in $x$ that have not yet been examined. This operation, if performed, must be the final operation.

As an example, one way to transform the source string $\text{algorithm}$ to the target string $\text{altruistic}$ is to use the following sequence of operations, where the underlined characters are $x[i]$ and $z[j]$ after the operation:

$$ \begin{array}{lll} \text{Operation} & x & z \\ \hline \textit{initial strings} & \underline algorithm & \text{\textunderscore} \\ \text{copy} & a\underline lgorithm & a\text{\textunderscore} \\ \text{copy} & al\underline gorithm & al\text{\textunderscore} \\ \text{replace by $t$} & alg\underline orithm & alt\text{\textunderscore} \\ \text{delete} & algo\underline rithm & alt\text{\textunderscore} \\ \text{copy} & algor\underline ithm & altr\text{\textunderscore} \\ \text{insert $u$} & algor\underline ithm & altru\text{\textunderscore} \\ \text{insert $i$} & algor\underline ithm & altrui\text{\textunderscore} \\ \text{insert $s$} & algor\underline ithm & altruis\text{\textunderscore} \\ \text{twiddle} & algorit\underline hm & altruisti\text{\textunderscore} \\ \text{insert $c$} & algorit\underline hm & altruistic\text{\textunderscore} \\ \text{kill} & algorithm\text{\textunderscore} & altruistic\text{\textunderscore} \end{array} $$

Note that there are several other sequences of transformation operations that transform $\text{algorithm}$ to $\text{altruistic}$.

Each of the transformation operations has an associated cost. The cost of an operation depends on the specific application, but we assume that each operation's cost is a constant that is known to us. We also assume that the individual costs of the copy and replace operations are less than the combined costs of the delete and insert operations; otherwise, the copy and replace operations would not be used. The cost of a given sequence of transformation operations is the sum of the costs of the individual operations in the sequence. For the sequence above, the cost of transforming $\text{algorithm}$ to $\text{altruistic}$ is

$$\text{($3 \cdot$ cost(copy)) + cost(replace) + cost(delete) + ($4 \cdot$ cost(insert)) + cost(twiddle) + cost(kill)}.$$

a. Given two sequences $x[1..m]$ and $y[1..n]$ and set of transformation-operation costs, the edit distance from $x$ to $y$ is the cost of the least expensive operatoin sequence that transforms $x$ to $y$. Describe a dynamic-programming algorithm that finds the edit distance from $x[1..m]$ to $y[1..n]$ and prints an optimal opeartion sequence. Analyze the running time and space requirements of your algorithm.

The edit-distance problem generalizes the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [310, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences $x$ and $y$ consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences $x'$ and $y'$ have the same length but do not have a space in the same position (i.e., for no position $j$ are both $x'[j]$ and $y'[j]$ a space). Then we assign a "score" to each position. Position $j$ receives a score as follows:

  • $+1$ if $x'[j] = y'[j]$ and neither is a space,
  • $-1$ if $x'[j] \ne y'[j]$ and neither is a space,
  • $-2$ if either $x'[j]$ or $y'[j]$ is a space.

The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences $x = \text{GATCGGCAT}$ and $y = \text{CAATGTGAATC}$, one alignment is

$$ \begin{array}{cccccccccccc} \text G & & \text A & \text T & \text C & \text G & & \text G & \text C & \text A & \text T & \\ \text C & \text A & \text A & \text T & & \text G & \text T & \text G & \text A & \text A & \text T & \text C \\ - & * & + & + & * & + & * & + & - & + & + & * \end{array} $$

A $+$ under a position indicates a score of $+1$ for that position, a $-$ indicates a score of $-1$, and a $*$ indicates a score of $-2$, so that this alignment has a total score of $6 \cdot -2 \cdot 1 - 4 \cdot 2 = -4$.

b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.

a. Dynamic programming is the ticket. This problem is slightly similar to the longest-common-subsequence problem. In fact, we'll define the notational conveniences $X_i$ and $Y_j$ in the similar manner as we did for the LCS problem: $X_i = x[1..i]$ and $Y_j = y[1..j]$.

Our subproblems will be determining an optimal sequence of operations that converts $X_i$ to $Y_j$, for $0 \le i \le m$ and $0 \le j \le n$. We'll call this the "$X_i \to Y_j$ problem." The original problem is the $X_m \to Y_n$ problem.

Let's suppose for the moment that we know what was the last operation used to convert $X_i$ to $Y_j$. There are six possibilities. We denote by $c[i, j]$ the cost of an optimal solution to the $X_i \to Y_j$ problem.

  • If the last operation was a copy, then we must have had $x[i] = y[j]$. The subproblem that remains is converting $X_{i - 1}$ to $Y_{j - 1}$. And an optimal solution to the $X_i \to Y_j$ problem must include an optimal solution to the $X_{i - 1} \to Y_{j - 1}$ problem. The cut-and-paste argument applies. Thus, assuming that the last operation was a copy, we have

    $$c[i, j] = c[i - 1, j - 1] + \text{cost(copy)}.$$

  • If it was a replace, then we must have had $x[i] \ne y[j]$. (Here, we assume that we cannot replace a character with itself. It is a straightforward modification if we allow replacement of a character with itself.) We have the same optimal substructure argument as for copy, and assuming that the last operation was a replace, we have

    $$c[i, j] = c[i - 1, j - 1] + \text{cost(replace)}.$$

  • If it was a twiddle, then we must have had both $x[i] = y[j - 1]$ and $x[i - 1] = y[j]$, along with the implicit assumption that $i, j \ge 2$. Now our subproblem is $X_{i - 2} \to Y_{j - 2}$ and, assuming that the last operation was a twiddle, we have

    $$c[i, j] = c[i - 2, j - 2] + \text{cost(twiddle)}.$$

  • If it was a delete, then we have no restrictions on $x$ or $y$. Since we can view delete as removing a character from $X_i$ and leaving $Y_j$ alone, our subproblem is $X_{i - 1} \to Y_j$. Assuming that the last operation was a delete, we have

    $$c[i, j] = c[i - 1, j] + \text{cost(delete)}.$$

  • If it was an insert, then we have no restrictions on $x$ or $y$. Our subproblem is $X_ i \to Y_{j - 1}$. Assuming that the last operation was an insert, we have

    $$c[i, j] = c[i, j - 1] + \text{cost(insert)}.$$

  • If it was a kill, then we had to have completed converting $X_m$ to $Y_n$, so that the current problem must be the $X_m \to Y_n$ problem. In other words, we must have $i = m$ and $j = n$. If we think of a kill as a multiple delete, we can get any $X_i \to Y_n$, where $0 \le i < m$, as a subproblem. We pick the best one, and so assuming that the last operation was a kill, we have

    $$c[m, n] = \min_{0 \le i < m}{c[i, n]} + \text{cost(kill)}.$$

We have not handled the base cases, in which $i = 0$ or $j = 0$. These are easy. $X_0$ and $Y_0$ are the empty strings. We convert an empty string into $Y_j$ by a sequence of $j$ inserts, so that $c[0, j] = j \cdot \text{cost(insert)}$. Similarly, we convert $X_i$ into $Y_0$ by a sequence of $i$ deletes, so that $c[i, 0] = i \cdot \text{cost(delete)}$. When $i = j = 0$, either formula gives us $c[0, 0] = 0$, which makes sense, since there's no cost to convert the empty string to the empty string.

For $i, j > 0$, our recursive formulation for $c[i, j]$ applies the above formulas in the situations in which they hold:

$$ c[i, j] = \min \begin{cases} c[i - 1, j - 1] + \text{cost(copy)} & \text{if $x[i] = y[j]$}, \\ c[i - 1, j - 1] + \text{cost(replace)} & \text{if $x[i] \ne y[j]$}, \\ c[i - 2, j - 2] + \text{cost(twiddle)} & \text{if $i, j \ge 2$}, \\ & x[i] = y[j - 1], \\ & \text{and $x[i - 1] = y[j]$}, \\ c[i - 1, j] + \text{cost(delete)} & \text{always}, \\ c[i, j] = c[i, j - 1] + \text{cost(insert)} & \text{always}, \\ \min\limits_{0 \le i < m} {c[i, n]} + \text{cost(kill)} & \text{if $i = m$ and $j = n$}. \end{cases} $$

Like we did for LCS, our pseudocode fills in the table in row-major order, i.e., row-by-row from top to bottom, and left to right within each row. Columnmajor order (column-by-column from left to right, and top to bottom within each column) would also work. Along with the $c[i, j]$ table, we fill in the table $op[i, j]$, holding which operation was used.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
EDIT-DISTANCE(x, y, m, n)
    let c[0..m, 0..n] and op[0..m, 0..n] be new arrays
    for i = 0 to m
        c[i, 0] = i * cost(delete)
        op[i, 0] = DELETE
    for j = 0 to n
        c[0, j] = j * cost(insert)
        op[0, j] = INSERT
    for i = 1 to m
        for j = 1 to n
            c[i, j] = 
            if x[i] == y[j]
                c[i, j] = c[i - 1, j - 1] + cost(copy)
                op[i, j] = COPY
            if x[i] != y[j] and c[i - 1, j - 1] + cost(replace) < c[i, j]
                c[i, j] = c[i - 1, j - 1] + cost(replace)
                op[i, j] = REPLACE(by y[j])
            if i  2 and j  2 and x[i] == y[j - 1] and x[i - 1] == y[j] and c[i - 2, j - 2] + cost(twiddle) < c[i, j]
                c[i, j] = c[i - 2, j - 2] + cost(twiddle)
                op[i, j] = TWIDDLE
            if c[i - 1, j] + cost(delete) < c[i, j]
                c[i, j] = c[i - 1, j] + cost(delete)
                op[i, j] = DELETE
            if c[i, j - 1] + cost(insert) < c[i, j]
                c[i, j] = c[i, j - 1] + cost(insert)
                op[i, j] = INSERT(y[j])
    for i = 0 to m - 1
        if c[i, n] + cost(kill) < c[m, n]
            c[m, n] = c[i, n] + cost(kill)
            op[m, n] = KILL i
    return c and op

The time and space are both $\Theta(mn)$. If we store a $\text{KILL}$ operation in $op[m, n]$, we also include the index $i$ after which we killed, to help us reconstruct the optimal sequence of operations. (We don't need to store $y[i]$ in the $op$ table for replace or insert operations.)

To reconstruct this sequence, we use the $op$ table returned by $\text{EDIT-DISTANCE}$. The procedure $\text{OP-SEQUENCE}(op, i, j)$ reconstructs the optimal operation sequence that we found to transform $X_i$ into $Y_j$. The base case is when $i = j = 0$. The first call is $\text{OP-SEQUENCE}(op, m, n)$.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
OP-SEQUENCE(op, i, j)
    if i == 0 and j = 0
        return
    if op[i, j] == COPY or op[i, j] = REPLACE
        i' = i - 1
        j' = j - 1
    else if op[i, j] == TWIDDLE
        i' = i - 2
        j' = j - 2
    else if op[i, j] == DELETE
        i' = i - 1
        j' = j
    else if op[i, j] == INSERT      // don't care yet what character is inserted
        i' = i
        j' = j - 1
    else                            // must be KILL, and must have i = m, and j = n
        let op[i, j] == KILL k
        i' = k
        j' = j
    OP-SEQUENCE(op, i', j')
    print op[i, j]

This procedure determines which subproblem we used, recurses on it, and then prints its own last operation.

b. The DNA-alignment problem is just the edit-distance problem, with

$$ \begin{array}{rcl} \text{cost(copy)} & = & -1, \\ \text{cost(replace)} & = & +1, \\ \text{cost(delete)} & = & +2, \\ \text{cost(insert)} & = & +2, \end{array} $$

and the twiddle and kill operations are not permitted.

The score that we are trying to maximize in the DNA-alignment problem is precisely the negative of the cost we are trying to minimize in the edit-distance problem. The negative cost of copy is not an impediment, since we can only apply the copy operation when the characters are equal.