Lecture 21: Pattern Matching, Part II

COSC 225: Algorithms and Visualization

Spring, 2023

Announcement

Final Project, Preliminary Presentations

Monday, May 1 in class
includes a substantial component of your project
demos in small groups
give/receive constructive feedback on projects

Last Time

Input.

a large text, TEXT
a smaller text, PATTERN

Output.

“yes” if TEXT contains PATTERN as a substring
or, starting index of first instance of PATTERN in TEXT
- -1 if PATTERN does not appear

Naive Pattern Matching

idx = 0;
matches = 0;
while (idx < TEXT.length - PATTERN.length) {
  if (matches == PATTERN.length) return idx;
  if (TEXT[idx + matches] == PATTERN[matches]) {
    matches++;
  } else {
    idx++;
    matches = 0;
  }
  return (matches == PATTERN.length) ? idx : -1;
}

Illustration of Procedure

lec21-naive-pattern-matching.zip

Credits

Display: Emi, Megan, Sarah
Input: Ramisa, Mariam, Sulagna
Algo: Max, Neville (modified by Will)

Unforseen Problem

Resetting the state of the algorithm

after updating text or pattern

Method should be called by update buttons (Input Group), but implementation should be handled by algorithm (Algo Group)

resetState method

A Better Design?

Search Efficiency

Throughout:

$n$ is the length of the TEXT
$m$ is the length of the PATTERN

Last Time, we showed worst-case running time is $\Theta(n \cdot m)$.

example:
- TEXT = 'aaaaaaaaaaaaa...a'
- PATTERN = 'aaaa....ab'

Redundant Work

see illustration of bad example

Question. What redundant/unnecessary work is being done by the algorithm?

Mis/Matches

What does naive pattern matching do?

A Puzzle

Suppose the following string is matched up to index i = 3, and mismatched at index i = 4. What should our next comparison be?

Another Puzzle

Suppose the following string is matched up to index i = 4, and mismatched at index i = 5. What should be our next comparison be?

Next Move?

Next Move!

A Challenge

Question. Can we perform pattern matching search in such a way that the textIndex never decreases?

Why should this be possible?

if we’ve matched characters in TEXT, then we know they are the same as the previous characters in PATTERN
we can just read these off of the pattern itself
better yet, we can pre-compute the next offsets for each mis-match

Establishing Notation

Pattern is $P$, length $m$
$P_k = P[0..k]$ is the prefix of length $k+1$
A suffix of length $\ell$ is the last $\ell$ elements of a (sub)pattern

A Shift Condition?

Question. Suppose we’ve matched $P_{k}$ with our text, but $P[k+1]$ is a mismatch. Under what condition can we match $P_i$ with our text?

A Shift Condition!

Question. Suppose we’ve matched $P_{k}$ with our text, but $P[k+1]$ is a mismatch. Under what condition can we match $P_i$ with our text?

Answer. We can match $P_i$ with the text if $P_i$ is a suffix of $P_k$

The Prefix Function

Definition. Given a pattern $P$ of length $m$, the associated prefix function is an array $\pi$ of length $m$ defined as follows:

$\pi[k] = i$ if $P_{i-1}$ is the longest prefix of $P$ that is a suffix of $P_k$

Activity

Write the prefix function of this pattern:

Faster Pattern Matching

Question. Given the prefix function $\pi$, how can we compute matches faster?

Idea.

Deal with first character mismatches as in naive strategy
Use matches and $\pi$ to do more efficient shifts:
- if first mis-match at index $k+1$, we know matched up to index $k$
- we know that for $i = \pi(k)$, $A_i$ matches the suffix of $A_k$
  - $\implies A_i$ also matches the text
- choose next shift to align $A_i$:
  - amount is is additional $k - i$

Knuth-Morris-Pratt Search

In Pseudo-code!

T a text of length n, P a pattern of length m, pi the prefix function of P

let matched = 0
for (i from 0 to n - 1):
    while matched > 0 and P[matched+1] != T[i]
        matched = pi[matched]
    if P[matched] == T[i]
        matched++
    if matched == m
        return i

Demo Time!

lec21-kmp-pattern-matching

Running Time

Question. What is the running time of the method?

let matched = 0
for (i from 0 to n - 1):
    while matched > 0 and P[matched+1] != T[i]
        matched = pi[matched]
    if P[matched] == T[i]
        matched++
    if matched == m
        return i

Amortized Analysis!

Observations.

the while loop does at most matched iterations
in order to do k iterations, matched must be incremented k times
each iteration of the for loop increments k once
$\implies$ total number of while loop iterations is $\leq$ number of for loop iteration

Still To Do

Computing $\pi$ of $P$ efficiently

easy: $O(m^2)$ time
not as easy: $O(m)$ time
- use dynamic programming (see demo code)