IMMENSE.LY

IT’S THAT BIG

Levenshtein Distance

A few weeks ago I posted about One Way Changes, which was one of the coding challenges in Cracking the Coding Interview. A One Way Change would check to see if two words, such as “flood” and “floor” are one change away from each other by removing, inserting, or substituting one character for another one.

I’ve been thinking a lot about search recently, and learning about where the industry is at with NLP. One of the first things I ran in to was an exercise for calculating how many inserted, removed, or substituted characters it takes to get from one word to another. This is the minimum edit distance or the Levenshtein distance between two words, and it seemed strange to me that CtCI didn’t include this problem in the Recursion and Dynamic Programming chapter, given that it superficially feels really close to the One Way Change exercise.

Surprisingly though, the easiest way you can calculate the Levenshtein distance is unrelated to actually inserting, removing, or substituting characters into your first string. I guess that shouldn’t be so suprising given how many possible characters there are that you could potentially try, but my first intuition was to try to think about how I could reuse my solution to the One Way Changes problem.

A far better solution involves calculating how close each potential substring of each string is to each other substring. The algorithm I ended up implementing was essentially the Wagner-Fisher algorithm which iteratively builds up a matrix of all possible substrings. Here was my solution:

def lev_distance(s1, s2, lev_cost=1):
    def lev_sub(c1, c2):
        if c1 != c2:
            return lev_cost
        return 0

    s1 = " " + s1
    s2 = " " + s2

    n = len(s1)
    m = len(s2)

    matrix = [[0 for x in range(n)] for y in range(m)]

    for x in range(n):
        matrix[0][x] = x
    for y in range(m):
        matrix[y][0] = y

    for y in range(1, m):
        for x in range(1, n):
            matrix[y][x] = min(
	        matrix[y][x-1]+1,
		matrix[y-1][x]+1,
		matrix[y-1][x-1] + lev_sub(s1[x], s2[y]))

    return matrix[m-1][n-1]

The first part of the function:

    matrix = [[0 for x in range(n)] for y in range(m)]

    for x in range(n):
        matrix[0][x] = x
    for y in range(m):
        matrix[y][0] = y

creates a zero matrix and then adds an index of both words into the first row and column of the matrix. If we had the two words bitten and knitting, we would end up with:

[[0, 1, 2, 3, 4, 5, 6],
 [1, 0, 0, 0, 0, 0, 0],
 [2, 0, 0, 0, 0, 0, 0],
 [3, 0, 0, 0, 0, 0, 0],
 [4, 0, 0, 0, 0, 0, 0],
 [5, 0, 0, 0, 0, 0, 0],
 [6, 0, 0, 0, 0, 0, 0]]
 [7, 0, 0, 0, 0, 0, 0]]
 [8, 0, 0, 0, 0, 0, 0]]

We now walk through each of the cells and take the minimum of three potential values:

  1. The cell to the left in this row plus one
  2. The cell above in this column plus one
  3. The cell to the left and above this cell plus the value from our levenshtein cost for the two characters at this place

This is done with the code:

    matrix[y][x] = min(
        matrix[y][x-1]+1,
        matrix[y-1][x]+1,
        matrix[y-1][x-1] + lev_sub(s1[x], s2[y]))

In the case of [1,1], we look at [0,1] and [1,0] we get a value of 2 and 2 for each respectively, and we look at [0,0] and add whatever lev_sub() returns, which in this case is a 1. The minimum of (2, 2, 1) is 1, which we then put back into our matrix at [1, 1].

Once we’ve gone through each character in both strings, we’ll end up with a matrix which looks like:

[[0, 1, 2, 3, 4, 5, 6],
 [1, 1, 2, 3, 4, 5, 6],
 [2, 2, 2, 3, 4, 5, 5],
 [3, 3, 2, 3, 4, 5, 6],
 [4, 4, 3, 2, 3, 4, 5],
 [5, 5, 4, 3, 2, 3, 4],
 [6, 6, 5, 4, 3, 3, 4],
 [7, 7, 6, 5, 4, 4, 3],
 [8, 8, 7, 6, 5, 5, 4]]

Back to our example of changing bitten to knitting, the shortest way to transform it is in four steps:

  1. bitten -> kbitten (insert a ‘k’)
  2. kbitten -> knitten (substitute ‘n’ for ‘b’)
  3. knitten -> knittin (substitute ‘i’ for ‘e’)
  4. knittin -> knitting (insert a ‘g’)

and if we look in our matrix in the last position [8,6], we get our result for the shortest number of steps.