Storage space can save calculation time. Tables can have common functional evaluations to reduce recursive calls. Another technique is to store preprocess calculations in an array to be used during the final calculations, this is called input enhancement. We consider one famous example for string matching, Boyer-Moore algorithm. First we introduce Horspool's Algorithm which is a simpler example.
Recall the string matching problem, a pattern P[0...m-1] is searched for in a text T[0...n-1]. The brute force algorithm in worst case makes m(n-m+1) comparisons, so the cost is Θ(nm). But on average only a few comparisons are made before shifting the pattern, so the cost is Θ(n). We consider two algorithms that also achieve this cost.
Horspool's algorithm shifts the pattern by looking up shift value in the character of the text aligned with the last character of the pattern in table made during the initialization of the algorithm. The pattern is check with the text from right to left and progresses left to right through the text.
Let c be the character in the text that aligns with the last character of the pattern. If the pattern does not match there are 4 cases to consider.
The mismatch occurs at the last character of the pattern:
Case 1: c does not exist in the pattern (Not the mismatch occurred here) then shift pattern right the size of the pattern.
T[0] ...
S ... T[n-1]
|
LEADER
LEADER
Case 2: The mismatch happens at the last character of the pattern and c does exist in the pattern then the shift should be to the right most c in the m-1 remaining characters of the pattern.
T[0] ...
A ... T[n-1]
|
LEADER
LEADER
The mismatch happens in the middle of the pattern:
Case 3: The mismatch happens in the middle (therefore c is in pattern) and there are no other c in the pattern then the shift should be the pattern length.
T[0] ...
MER ... T[n-1]
|
LEADER
LEADER
Case 4: The mismatch happens in the middle of the pattern but there is other c in pattern then the shift should be the right most c in the m-1 remaining characters of the pattern.
T[0] ...
|
LEADER
LEADER
The table of shift values, table(c), is a table of the entire alphabet of the text and should give
t(c) = m if c is not in the first m-1 characters of the pattern
t(c) = distance of the right most c in the first m-1 characters of the pattern
To make the shift table we initialize the table to m and then scan the pattern left to right writing m-1-j in for the j character in the pattern.
Algorithm ShiftTable(P[0...m-1])
// output Table the size of the alphabet
// and gives the amount to shift the table.
initialize all entries of Table to m
for j ← 0 to m-2 do Table[P[j]] ← m-1-j // character position from end of pattern
return Table
The line
for j ← 0 to m-2 do Table[P[j]] ← m-1-j
Finds the right most location in the pattern of the character
What is the size of the extra storage? What is the cost of making the shift table? Not m, but rather the size of the alphabet.
Now we can use Table in Horspol's algorithm.
Algorithm HorspoolMatching(P[0...m-1], T[0...n-1])
Table ← ShiftTable(P[0...m-1])
i ← m-1 // index for text aligned with the last character of the pattern
while i ≤ n-1 do
k ← 0 // matching index
while k ≤ m-1 and P[m-1-k]==T[i-k] do k++ // check matching
if k = m then return i-m+1 // Match
else i ← i + Table[T[i]] // Shift
return -1
Example
Text: JIMY_HAILED_THE_LEADER_TO_STOP
Pattern: LEADER
JIMY_RAN_AND_HAILED_THE_LEADER_TO_STOP
||
| | | ||
|
LEADER | |
| || |
LEADER | |
|| |
LEADER |
|| |
LEADER || |
LEADER| |
LEADER |
LEADER
The worst case cost is Θ(nm), but for random text is Θ(n).
Horspool's algorithm only used the character value of the text aligned with last character of the pattern to determine the shift. Boyer-Moore algorithm also uses the location and the character mismatch to calculate the shift. In addition it uses the occurrences of the suffixes in the pattern to determine the shift.
Boyer-Moore algorithm calculates two shift values.
Bad-symbol shift: c represents the character of the text at the mismatch
The four case to consider are the same cases as in Horspool's algorithm. Bad-symbol shift also uses the Horspool's table, but c is referenced to the character in the text where the mismatch occurred instead of the last character of the pattern. For cases 1 and 2, c is the same as for Horspool's
Mismatch happens at
the end of the pattern:
Case 1 and 2: If the mismatch is the first character checked (rightmost in the pattern) Boyer-Moore shifts the same as Horspool's algorithm.
Case 1 shifts the pattern m and Case 2 shits to the right most occurrence of c in the pattern, t1(c), where t1 is Hoorspool's shift table.
Mismatch happens in
the middle of the pattern:
Let k be number of correct matches since the last shift, 0 < k < m. And c is now the character value of the mismatch (recall that in Horspool's algorithm c was the text character value aligned with the last character of the pattern.
Case 3: Mismatch happens after k matches and no c in the pattern. Then the shift is
t1(c)-k
where t1 is the shift table used in Horspool's algorithm.
Boyer-Moore should shifts the pattern m characters. Note:
t1(c) = m-1-j,
where j is the right most position of c. This is where the mismatch occurred. Therefore k -1 = j. (Recall that k is the number of good matches.)
t1(c) - k = m-1-j - (j+1) = m.
Note that Horspool's would shift either to the second right most occurrence of the last character in the pattern (<m) or m if there were none.
Case 4: Mismatch happens after k matches and there is a c in the pattern.
We should shift to the right most occurrence of c in the pattern (= t1(c)) if it is to the left of the mismatch location, meaning t1(c)-k > 0.
If the right most occurrence of c is to the right mismatch location, meaning t1(c)-k ≤ 0, then we can only shift by 1.
If t1(c)-k > 0 then it is a valid shift, else (t1(c)-k ≤ 0) it is an invalid shift so use the brute force shift. Then Boyer-Moore algorithm shifts by only 1, brute force.
Note that Horspool's algorithm would shift to second right most occurrence of c in the pattern, this is not always the same as Boyer-Moore's shift, and can be more,
For example
T[0] ...
AER ... T[n-1]
|||
BARBER
BARBER // Boyer-Moore shift
BARBER // Horspool shift is more
Summarize the bad-symbol shift:
Cases 1, 2 are the same as Horspool's shift, t1(c), in these cases k = 0 so the shifts are t1(c) - k.
Case 3 shifts t1(c)-k = m.
Case 4 shifts max(t1(c)-k, 1)
The shift amount d1 = max(t1(c)-k, 1)
Good-suffix shift: occurrences of suffix in the pattern
Let suff(k) the string representing the matching suffix.
Case 1: If there is another occurrence of suff(k) in the pattern then we should shift d2, the distance between suff(k) and the right most repeated occurrences. This distance is the measured between the start of both suffix occurrences.
Case 2: If there is not a second occurrence of suff(k) in the pattern then we might expect that we should shift the length of the pattern. This is not quite correct because there could be substring in suff(k) that matches a prefix of the pattern. The shift should align the prefix with matched suff(k) in the text.
So if there is not a right most of occurrence of suff(k) then the shift, d2, should be the distance between the largest prefix matching a substring in suff(k).
Case 2 example:
T[0] ...
ABAB ... T[n-1]
||||
ABCBAB // k=3
ABCBAB // Shift by k, wrong
ABCBAB // Shift by d2, correct
Good-suffix shit table example:
k |
pattern |
d2 |
reason |
1 |
ABCBAB |
2 |
Second occurrence |
2 |
ABCBAB |
4 |
Second occurrence |
3 |
ABCBAB |
4 |
Suffix has prefix |
4 |
ABCBAB |
4 |
Suffix has prefix |
5 |
ABCBAB |
4 |
Suffix has prefix |
Bold represents the suffix and italic represents either the matching second occurrence of the suffix or matching prefix.
Boyer-Moore Algorithm outline
1. Using the pattern and the text, construct the bad-symbol shift table as for Horspool's algorithm
2. Using the pattern, construct the good-suffix shift table
3. Align the pattern against the beginning of the text
4. Repeat until match occurs or the pattern reaches the end of the text.
Starting with the last character of the pattern and check the matching.
Track the number of matches, k.
Calculate d1 = max(t1(c)-k, 1).
If k > 0 also get d2 from the good suffix shift table,
shift d = max(d1, d2).
The worst case cost is linear in text size
Example
Pattern: BAOBAB
Text: BESS_KNEW_ABOUT_BAOBABS
Make the bad shift table, t1
c |
A |
B |
C |
D |
... |
O |
... |
t1(c) |
1 |
2 |
6 |
6 |
6 |
3 |
6 |
Make good suffix shift table, t2
k |
pattern |
d2 |
1 |
BAOBAB |
2 |
2 |
BAOBAB |
5 |
3 |
BAOBAB |
5 |
4 |
BAOBAB |
5 |
5 |
BAOBAB |
5 |
BESS_KNEW_ABOUT_BAOBABS
|
||| |||||||
BAOBAB ||| ||||||| // k = 0, t1(c) = 6, d1= max(6-0, 1) = 6, (k = 0, no need to check t2), d = 6
|||
|||||||
BAOBAB ||||||| // k = 2, t1(c) = 6, d1= max(6-2, 1) = 4, t2 = 5, d = 5
|||||||
BAOBAB||||| // k = 1, t1(c) = 6, d1= max(6-1, 1) = 5, t2 = 2, d = 5
||||||
BAOBAB // match