Testing Text Similarity

Subsequence

A subsequence of string X=x0x1x2...xn-1 is a string xi1xi2...xik such that ij < ij+1.  So the sequence of characters are not necessarily sequential but are in order.

Illustrate

Longest Common Subsequence

The longest common subsequence (LCS) is as implied by the phase is a subsequence of two strings; X=x0x1x2...xn-1 and Y=y0y1...ym-1 such it is the longest.  Note the LCS is not necessarily unique.

A current research topic in bio-informatics.  DNA are represented as strings of the small alphabet, Sigma = {A, C, G, T}.    DNA strands/strings can be changed by mutation by insertion of a character  into the string.  A longest common subsequence of two strings can represent the common ancestry of the two strings.

Finding the LCS

Brute Force Algorithm

We enumerate all the subsequences in X and then start with the longest check if it is a subsequence of Y.  Because a character in X can either be a member or not a member of the subsequence the number of subsequence is 2n.  Checking if the sequence is a subsequence of Y takes O(m).  So worst case running time is O(m2n).  This is called exponential time, to be avoided at all cost even at the cost of more space.

Dynamic Programing 

A technique for solving optimization problem that are exponential time using brute force algorithmic problems, which is polynomial time.  Dynamic programing is similar to the greedy algorithm in that we globally optimize by locally optimizing a sub problem.  Dynamic programing is different from the greedy algorithm in that it uses an additional data structure, a table or multidimensional array.  Each entry in the table represents optimal value for a particular sub problem.  The indices represent a particular sub problem.  So the sub problem must be representable by integers given the global problem.

Dynamic Programing for LCS

L[i, j] is our table representing the length of the LCS for strings X[0..i] and Y[0..j].  Where i represent the character in string X and j in Y.  The goal is to construct the table using the rules:

The table is built either row wise or column wise.  L[i, j] represents the length of the LCS for x0x1...xi and y0y1...yj, the sub problem.    Note that the LCS is built adding one match at a time search both the X or Y. In the finish table L[n, m] is the length of the LCS.

Algorithm LCS(X, Y)

Input: Strings X and Y with n and m elements
Output: The table L[i, j]  the LCS for the sub problems x0x1...xi and y0y1...yj,
for i = -1 to n-1 do
    L[i, -1] = 0
for j = 0 to m -1 do
    L[-1, j] = 0
for i = 0 to n-1 do
    for j = 0 to m-1 do
if xi == yj then
    L[i, j]  = L[i-1, j-1] +1
else
    L[i, j] = max{ L[i-1, j], L[i, j-1]}
return L

Cost

The running time is clearly O(nm) due to the nested for loops. The additional space is also O(nm) for the table L.  The table L is only necessary if we want to construct the LCS.  If we only want to find the length of the LCS then we only need to keep the previous row and column of L.  So additional storage is O(n + m).

Constructing the LCS from L

We build the LCS from back to front. Starting at L[n, m] we move through L towards one of the sides.  If counter corner value of L is one less then that means in constructing L we found a match and xi == yj = c so c is the next to last character of a LCS.  (Recall we are building the sequence in reverse).  If the counter corner is equal then we move up the column or down row depending which has the largest L value.  If the row and column have the same value this means that the LCS is not unique and it does not matter which direction we go.  Note we record a character of the LCS only when we move counter corner.  The running time is O(n+m).