# Levenshtein Distance

A few weeks ago I posted about One Way Changes, which
was one of the coding challenges in Cracking the Coding
Interview. A *One Way Change*
would check to see if two words, such as “flood” and “floor” are one change
away from each other by removing, inserting, or substituting one character for
another one.

I’ve been thinking a lot about search recently, and learning about where the
industry is at with NLP. One of the first things I ran in to was an exercise
for calculating how many inserted, removed, or substituted characters it takes
to get from one word to another. This is the *minimum edit distance* or the
*Levenshtein distance* between two words, and it seemed strange to me that CtCI
didn’t include this problem in the Recursion and Dynamic Programming chapter,
given that it superficially feels really close to the *One Way Change*
exercise.

Surprisingly though, the easiest way you can calculate the Levenshtein distance
is unrelated to actually inserting, removing, or substituting characters into
your first string. I guess that shouldn’t be so suprising given how many
possible characters there are that you could potentially try, but my first
intuition was to try to think about how I could reuse my solution to the *One
Way Changes* problem.

A far better solution involves calculating how close each potential substring of each string is to each other substring. The algorithm I ended up implementing was essentially the Wagner-Fisher algorithm which iteratively builds up a matrix of all possible substrings. Here was my solution:

```
def lev_distance(s1, s2, lev_cost=1):
def lev_sub(c1, c2):
if c1 != c2:
return lev_cost
return 0
s1 = " " + s1
s2 = " " + s2
n = len(s1)
m = len(s2)
matrix = [[0 for x in range(n)] for y in range(m)]
for x in range(n):
matrix[0][x] = x
for y in range(m):
matrix[y][0] = y
for y in range(1, m):
for x in range(1, n):
matrix[y][x] = min(
matrix[y][x-1]+1,
matrix[y-1][x]+1,
matrix[y-1][x-1] + lev_sub(s1[x], s2[y]))
return matrix[m-1][n-1]
```

The first part of the function:

```
matrix = [[0 for x in range(n)] for y in range(m)]
for x in range(n):
matrix[0][x] = x
for y in range(m):
matrix[y][0] = y
```

creates a zero matrix and then adds an index of both words into the first row
and column of the matrix. If we had the two words **bitten** and **knitting**,
we would end up with:

```
[[0, 1, 2, 3, 4, 5, 6],
[1, 0, 0, 0, 0, 0, 0],
[2, 0, 0, 0, 0, 0, 0],
[3, 0, 0, 0, 0, 0, 0],
[4, 0, 0, 0, 0, 0, 0],
[5, 0, 0, 0, 0, 0, 0],
[6, 0, 0, 0, 0, 0, 0]]
[7, 0, 0, 0, 0, 0, 0]]
[8, 0, 0, 0, 0, 0, 0]]
```

We now walk through each of the cells and take the minimum of three potential values:

- The cell to the left in this row plus one
- The cell above in this column plus one
- The cell to the left and above this cell plus the value from our levenshtein cost for the two characters at this place

This is done with the code:

```
matrix[y][x] = min(
matrix[y][x-1]+1,
matrix[y-1][x]+1,
matrix[y-1][x-1] + lev_sub(s1[x], s2[y]))
```

In the case of [1,1], we look at [0,1] and [1,0] we get a value of 2 and 2 for
each respectively, and we look at [0,0] and add whatever `lev_sub()`

returns,
which in this case is a 1. The minimum of (2, 2, 1) is 1, which we then put
back into our matrix at [1, 1].

Once we’ve gone through each character in both strings, we’ll end up with a matrix which looks like:

```
[[0, 1, 2, 3, 4, 5, 6],
[1, 1, 2, 3, 4, 5, 6],
[2, 2, 2, 3, 4, 5, 5],
[3, 3, 2, 3, 4, 5, 6],
[4, 4, 3, 2, 3, 4, 5],
[5, 5, 4, 3, 2, 3, 4],
[6, 6, 5, 4, 3, 3, 4],
[7, 7, 6, 5, 4, 4, 3],
[8, 8, 7, 6, 5, 5, 4]]
```

Back to our example of changing **bitten** to **knitting**, the shortest way to
transform it is in four steps:

**b**itten ->**k**bitten*(insert a ‘k’)*- k
**b**itten -> k**n**itten*(substitute ‘n’ for ‘b’)* - knitt
**e**n -> knitt**i**n*(substitute ‘i’ for ‘e’)* - knittin -> knittin
**g***(insert a ‘g’)*

and if we look in our matrix in the last position [8,6], we get our result for the shortest number of steps.