15-5 Edit distance

In order to transform one source string of text $x[1..m]$ to a target string $y[1..n]$, we can perform various transformation operations. Our goal is, given $x$ and $y$, to produce a series of transformations that change $x$ to $y$. We use an array $z$—assumed to be large enough to hold all the characters it will need—to hold the intermediate results. Initially, $z$ is empty, and at termination, we should have $z[j] = y[j]$ for $j = 1, 2, \ldots, n$. We maintain current indices $i$ into $x$ and $j$ into $z$, and the operations are allowed to alter $z$ and these indices. Initially, $i = j = 1$. We are required to examine every character in $x$ during the transformation, which means that at the end of the sequence of transformation operations, we must have $i = m + 1$.

We may choose from among six transformation operations:

Copy a character from $x$ to $z$ by setting $z[j] = x[i]$ and then incrementing both $i$ and $j$. This operation examines $x[i]$.

Replace a character from $x$ by another character $c$, by setting $z[j] = c$, and then incrementing both $i$ and $j$. This operation examines $x[i]$.

Delete a character from $x$ by incrementing $i$ but leaving $j$ alone. This operation examines $x[i]$.

Insert the character $c$ into $z$ by setting $z[j] = c$ and then incrementing $j$, but leaving $i$ alone. This operation examines no characters of $x$.

Twiddle (i.e., exchange) the next two characters by copying them from $x$ to $z$ but in the opposite order; we do so by setting $z[j] = x[i + 1]$ and $z[j + 1] = x[i]$ and then setting $i = i + 2$ and $j = j + 2$. This operation examines $x[i]$ and $x[i + 1]$.

Kill the remainder of $x$ by setting $i = m + 1$. This operation examines all characters in $x$ that have not yet been examined. This operation, if performed, must be the final operation.

As an example, one way to transform the source string $\text{algorithm}$ to the target string $\text{altruistic}$ is to use the following sequence of operations, where the underlined characters are $x[i]$ and $z[j]$ after the operation:

$$ \begin{array}{lll} \text{Operation} & x & z \\ \hline \textit{initial strings} & \underline algorithm & \text{\textunderscore} \\ \text{copy} & a\underline lgorithm & a\text{\textunderscore} \\ \text{copy} & al\underline gorithm & al\text{\textunderscore} \\ \text{replace by $t$} & alg\underline orithm & alt\text{\textunderscore} \\ \text{delete} & algo\underline rithm & alt\text{\textunderscore} \\ \text{copy} & algor\underline ithm & altr\text{\textunderscore} \\ \text{insert $u$} & algor\underline ithm & altru\text{\textunderscore} \\ \text{insert $i$} & algor\underline ithm & altrui\text{\textunderscore} \\ \text{insert $s$} & algor\underline ithm & altruis\text{\textunderscore} \\ \text{twiddle} & algorit\underline hm & altruisti\text{\textunderscore} \\ \text{insert $c$} & algorit\underline hm & altruistic\text{\textunderscore} \\ \text{kill} & algorithm\text{\textunderscore} & altruistic\text{\textunderscore} \end{array} $$

Note that there are several other sequences of transformation operations that transform $\text{algorithm}$ to $\text{altruistic}$.

Each of the transformation operations has an associated cost. The cost of an operation depends on the specific application, but we assume that each operation's cost is a constant that is known to us. We also assume that the individual costs of the copy and replace operations are less than the combined costs of the delete and insert operations; otherwise, the copy and replace operations would not be used. The cost of a given sequence of transformation operations is the sum of the costs of the individual operations in the sequence. For the sequence above, the cost of transforming $\text{algorithm}$ to $\text{altruistic}$ is

$$\text{($3 \cdot$ cost(copy)) + cost(replace) + cost(delete) + ($4 \cdot$ cost(insert)) + cost(twiddle) + cost(kill)}.$$

a. Given two sequences $x[1..m]$ and $y[1..n]$ and set of transformation-operation costs, the edit distance from $x$ to $y$ is the cost of the least expensive operatoin sequence that transforms $x$ to $y$. Describe a dynamic-programming algorithm that finds the edit distance from $x[1..m]$ to $y[1..n]$ and prints an optimal opeartion sequence. Analyze the running time and space requirements of your algorithm.

The edit-distance problem generalizes the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [310, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences $x$ and $y$ consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences $x'$ and $y'$ have the same length but do not have a space in the same position (i.e., for no position $j$ are both $x'[j]$ and $y'[j]$ a space). Then we assign a "score" to each position. Position $j$ receives a score as follows:

  • $+1$ if $x'[j] = y'[j]$ and neither is a space,
  • $-1$ if $x'[j] \ne y'[j]$ and neither is a space,
  • $-2$ if either $x'[j]$ or $y'[j]$ is a space.

The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences $x = \text{GATCGGCAT}$ and $y = \text{CAATGTGAATC}$, one alignment is

$$ \begin{array}{cccccccccccc} \text G & & \text A & \text T & \text C & \text G & & \text G & \text C & \text A & \text T & \\ \text C & \text A & \text A & \text T & & \text G & \text T & \text G & \text A & \text A & \text T & \text C \\ - & * & + & + & * & + & * & + & - & + & + & * \end{array} $$

A $+$ under a position indicates a score of $+1$ for that position, a $-$ indicates a score of $-1$, and a $*$ indicates a score of $-2$, so that this alignment has a total score of $6 \cdot -2 \cdot 1 - 4 \cdot 2 = -4$.

b. Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.

a. We will index our subproblems by two integers, $1 \le i \le m$ and $1 \le j \le n$. We will let $i$ indicate the rightmost element of $x$ we have not processed and $j$ indicate the rightmost element of $y$ we have not yet found matches for. For a solution, we call $\text{EDIT}(x, y, i, j)$.

b. We will set

$$\text{cost(delete)} = \text{cost(insert)} = 2,$$

$$\text{cost(copy)} = −1,$$

$$\text{cost(replace)} = 1,$$

and

$$\text{cost(twiddle)} = \text{cost(kill)} = \infty.$$

Then a minimum cost translation of the first string into the second corresponds to an alignment.

We view

  • a $\text{copy}$ or a $\text{replace}$ as incrementing a pointer for both strings,
  • a $\text{insert}$ as putting a space at the current position of the pointer in the first string, and
  • a $\text{delete}$ operation means putting a space in the current position in the second string.

Since $\text{twiddle}$s and $\text{kill}$s have infinite costs, we will have neither of them in a minimal cost solution. The final value for the alignment will be the negative of the minimum cost sequence of edits.

EDIT(x, y, i, j)
    let m = x.length
    let n = y.length
    if i == m
        return (n - j)cost(insert)
    if j == n
        return min{(m - i)cost(delete), cost(kill)}
    initialize o1, ..., o5 to 
    if x[i] == y[j]
        o1 = cost(copy) + EDIT(x, y, i + 1, j + 1)
    o2 = cost(replace) + EDIT(x, y, i + 1, j + 1)
    o3 = cost(delete) + EDIT(x, y, i + 1, j)
    o4 = cost(insert) + EDIT(x, y, i, j + 1)
    if i < m - 1 and j < n - 1
        if x[i] == y[j + 1] and x[i + 1] == y[j]
            o5 = cost(twiddle) + EDIT(x, y, i + 2, j + 2)
    return min_{i  [5]}{o_i}