Space and Time Tradeoffs

Storage space can save calculation time. Tables can have common functional evaluations to reduce recursive calls. Another technique is to store preprocess calculations in an array to be used during the final calculations, this is called input enhancement. We consider one famous example for string matching, Boyer-Moore algorithm. First we introduce Horspool's Algorithm which is a simpler example.

 

Input Enhancement in String Matching

Recall the string matching problem, a pattern P[0...m-1] is searched for in a text T[0...n-1]. The brute force algorithm in worst case makes m(n-m+1) comparisons, so the cost is Θ(nm). But on average only a few comparisons are made before shifting the pattern, so the cost is Θ(n). We consider two algorithms that also achieve this cost.

Horspool's Algorithm

Horspool's algorithm shifts the pattern by looking up shift value in the character of the text aligned with the last character of the pattern in table made during the initialization of the algorithm. The pattern is check with the text from right to left and progresses left to right through the text.

 

Let c be the character in the text that aligns with the last character of the pattern. If the pattern does not match there are 4 cases to consider.

 

The mismatch occurs at the last character of the pattern:

Case 1: c does not exist in the pattern (Not the mismatch occurred here) then shift pattern right the size of the pattern.

 

T[0] ...      S      ... T[n-1]

              |

         LEADER

                                     LEADER

 

Case 2: The mismatch happens at the last character of the pattern and c does exist in the pattern then the shift should be to the right most c in the m-1 remaining characters of the pattern.

 

T[0] ...      A      ... T[n-1]

              |

         LEADER

                              LEADER

 

The mismatch happens in the middle of the pattern:

Case 3: The mismatch happens in the middle (therefore c is in pattern) and there are no other c in the pattern then the shift should be the pattern length.

 

 

T[0] ...    MER      ... T[n-1]

            |

         LEADER

                                     LEADER

 

 

Case 4: The mismatch happens in the middle of the pattern but there is other c in pattern then the shift should be the right most c in the m-1 remaining characters of the pattern.

 

T[0] ...   EDER      ... T[n-1]

           |

         LEADER

                 LEADER

 

 

The table of shift values, table(c), is a table of the entire alphabet of the text and should give

t(c) = m if c is not in the first m-1 characters of the pattern

t(c) = distance of the right most c in the first m-1 characters of the pattern

 

To make the shift table we initialize the table to m and then scan the pattern left to right writing m-1-j in for the j character in the pattern.

 

Algorithm ShiftTable(P[0...m-1])

// output Table the size of the alphabet

// and gives the amount to shift the table.

initialize all entries of Table to m

for j ← 0 to m-2 do Table[P[j]] ← m-1-j // character position from end of pattern

return Table

 

 

The line

for j ← 0 to m-2 do Table[P[j]] ← m-1-j

Finds the right most location in the pattern of the character

 

What is the size of the extra storage? What is the cost of making the shift table? Not m, but rather the size of the alphabet.

 

Now we can use Table in Horspol's algorithm.

 

 

 

Algorithm HorspoolMatching(P[0...m-1], T[0...n-1])

TableShiftTable(P[0...m-1])

im-1 // index for text aligned with the last character of the pattern

while in-1 do

k ← 0 // matching index

while km-1 and P[m-1-k]==T[i-k] do k++ // check matching

if k = m then return i-m+1 // Match

else ii + Table[T[i]] // Shift

return -1

 

 

Example

 

Text: JIMY_HAILED_THE_LEADER_TO_STOP

Pattern: LEADER

 

JIMY_RAN_AND_HAILED_THE_LEADER_TO_STOP

    ||     | |     |     ||  |

LEADER     | |     |     ||  |

      LEADER |     |     ||  |

        LEADER     |     ||  |

              LEADER     ||  |

                    LEADER|  |

                     LEADER  |

                        LEADER

 

 

The worst case cost is Θ(nm), but for random text is Θ(n).

 

Boyer-Moore Algorithm

Horspool's algorithm only used the character value of the text aligned with last character of the pattern to determine the shift. Boyer-Moore algorithm also uses the location and the character mismatch to calculate the shift. In addition it uses the occurrences of the suffixes in the pattern to determine the shift. 

 

Boyer-Moore algorithm calculates two shift values.

 

Bad-symbol shift: c represents the character of the text at the mismatch

The four case to consider are the same cases as in Horspool's algorithm. Bad-symbol shift also uses the Horspool's table, but c is referenced to the character in the text where the mismatch occurred instead of the last character of the pattern. For cases 1 and 2, c  is the same as for Horspool's

 

Mismatch happens at the end of the pattern:

Case 1 and 2: If the mismatch is the first character checked (rightmost in the pattern) Boyer-Moore shifts the same as Horspool's algorithm.

 

Case 1 shifts the pattern m and Case 2 shits to the right most occurrence of c in the pattern, t1(c), where t1 is Hoorspool's shift table.

 

Mismatch happens in the middle of the pattern:

Let k be number of correct matches since the last shift, 0 < k < m. And c is now the character value of the mismatch (recall that in Horspool's algorithm c was the text character value aligned with the last character of the pattern.

 

Case 3: Mismatch happens after k matches and no c in the pattern. Then the shift is

t1(c)-k

where t1 is the shift table used in Horspool's algorithm.

Boyer-Moore should shifts the pattern m characters. Note:

t1(c)  = m-1-j,

where j is the right most position of c. This is where the mismatch occurred. Therefore k -1 = j. (Recall that k is the number of good matches.)

t1(c) - k = m-1-j - (j+1) = m.

 

Note that Horspool's would shift either to the second right most occurrence of the last character in the pattern (<m) or m if there were none.

 

Case 4: Mismatch happens after k matches and there is a c in the pattern.

We should shift to the right most occurrence of c in the pattern (= t1(c)) if it is to the left of the mismatch location, meaning t1(c)-k > 0.

 

If the right most occurrence of c is to the right mismatch location, meaning t1(c)-k ≤ 0, then we can only shift by 1.

 

If t1(c)-k > 0 then it is a valid shift, else (t1(c)-k ≤ 0) it is an invalid shift so use the brute force shift. Then Boyer-Moore algorithm shifts by only 1, brute force.

 

Note that Horspool's algorithm would shift to second right most occurrence of c in the pattern, this is not always the same as Boyer-Moore's shift, and can be more,

For example

 

T[0] ...    AER      ... T[n-1]

            |||

         BARBER 

           BARBER  // Boyer-Moore shift

            BARBER // Horspool shift is more

 

 

Summarize the bad-symbol shift:

Cases 1, 2 are the same as Horspool's shift, t1(c), in these cases k = 0 so the shifts are t1(c) - k.

Case 3 shifts t1(c)-k = m.

Case 4 shifts max(t1(c)-k, 1)

 

The shift amount d1 = max(t1(c)-k, 1)

 

Good-suffix shift: occurrences of suffix in the pattern

Let suff(k) the string representing the matching suffix.

 

Case 1: If there is another occurrence of suff(k) in the pattern then we should shift d2, the distance between suff(k) and the right most repeated occurrences. This distance is the measured between the start of both suffix occurrences.

 

Case 2: If there is not a second occurrence of suff(k) in the pattern then we might expect that we should shift the length of the pattern. This is not quite correct because there could be substring in suff(k) that matches a prefix of the pattern. The shift should align the prefix with matched suff(k) in the text.

 

So if there is not a right most of occurrence of suff(k) then the shift, d2, should be the distance between the largest prefix matching a substring in suff(k).

 

Case 2 example:

 

T[0] ...   ABAB      ... T[n-1]

           ||||

         ABCBAB  // k=3

            ABCBAB  // Shift by k, wrong

           ABCBAB  // Shift by d2, correct

 

Good-suffix shit table example:

 

k

pattern

d2

reason

1

ABCBAB

2

Second occurrence

2

ABCBAB

4

Second occurrence

3

ABCBAB

4

Suffix has prefix

4

ABCBAB

4

Suffix has prefix

5

ABCBAB

4

Suffix has prefix

 

Bold represents the suffix and italic represents either the matching second occurrence of the suffix or matching prefix.

 

 

Boyer-Moore Algorithm outline

1. Using the pattern and the text, construct the bad-symbol shift table as for Horspool's algorithm

2. Using the pattern, construct the good-suffix shift table

3. Align the pattern against the beginning of the text

4. Repeat until match occurs or the pattern reaches the end of the text.

Starting with the last character of the pattern and check the matching.

Track the number of matches, k.

Calculate d1 = max(t1(c)-k, 1).

If k > 0 also get d2 from the good suffix shift table,

shift d = max(d1, d2).

 

 

The worst case cost is linear in text size

 

Example

 

Pattern: BAOBAB

Text: BESS_KNEW_ABOUT_BAOBABS

 

Make the bad shift table, t1

c

A

B

C

D

...

O

...

t1(c)

1

2

6

6

6

3

6

 

 

Make good suffix shift table, t2

k

pattern

d2

1

BAOBAB

2

2

BAOBAB

5

3

BAOBAB

5

4

BAOBAB

5

5

BAOBAB

5

 

 

BESS_KNEW_ABOUT_BAOBABS

     |   |||   |||||||

BAOBAB   |||   |||||||      // k = 0, t1(c) = 6, d1= max(6-0, 1) = 6, (k = 0, no need to check t2), d = 6

         |||   |||||||

      BAOBAB   |||||||     // k = 2, t1(c) = 6, d1= max(6-2, 1) = 4, t2 = 5, d = 5

               |||||||

           BAOBAB|||||     // k = 1, t1(c) = 6, d1= max(6-1, 1) = 5, t2 = 2, d = 5

                ||||||

                BAOBAB     // match