13.7 An Extended Example (Searching in Text)

One of the most commonly used functions in word processing (and other text manipulation) is that of searching some body of text (the target) for a particular pattern of letters (the pattern.) For example, the ISO module Strings has

PROCEDURE FindNext(pattern, stringToSearch : ARRAY OF CHAR;
           startIndex : CARDINAL;
           VAR patternFound : BOOLEAN;
           VAR posOfPattern : CARDINAL);

The simplest and most natural algorithm for implementing such a search is to count through the target until the first letter of the pattern is matched, then count on both strings to see if the match continues. If it does for the entire pattern, the whole string matches. If a mismatch is found, one then continues to try to match the first character of the pattern with characters in the target.

Example:

Find the pattern "sip" in the target "Mississippi."

Progress:

Compare "s" to "M" and "i" and, finding no match, advance.

Finding a match at "s" look at "i" which fails to match.

Mississippi
  sip

The next try letter matches "s," again and then also "i", but not "p."

Mississippi
   sip

At the next position there is a mismatch, and then there is another match at "s," but the next letter fails. On the next advance, one has

Mississippi
      sip

and here the desired match is obtained. If m and n denote the lengths of the two strings, this (brute force) algorithm may require up to mn comparisons, that is, it is O(mn). Here is an implementation of this procedure :

PROCEDURE FindNext(pattern, stringToSearch : ARRAY OF CHAR;
           startIndex : CARDINAL;
           VAR patternFound : BOOLEAN;
           VAR posOfPattern : CARDINAL);
VAR
  countSi : CARDINAL; (* initial stringToSearch position count *)
  countP : CARDINAL; (* pattern count *)
  countSr : CARDINAL; (* count to the right of stringToSearch start *)
  endP : CARDINAL; (* last pattern position *)
  lenS : CARDINAL; (* length of stringToSearch *)
  ch : CHAR; (* temp holder to reduce referencing *)
  
BEGIN
  patternFound := FALSE;
  endP := LENGTH (pattern);
  IF endP <> 0 (* if is zero, do nothing *)
    THEN
      DEC(endP); (* last position used is one less than length *)
      lenS := LENGTH(stringToSearch);
      countSi := startIndex;

      IF endP = 0
        THEN (* was one character only *)
          ch := pattern[0]; (* save it *)
          WHILE (countSi < lenS) AND (ch <> stringToSearch[countSi])
            DO (* and go compare *)
              INC(countSi);
            END; (* while count *)
          
          IF countSi < lenS
            THEN (* we exited the loop before end of string *)
              patternFound := TRUE; (* so we got that one match *)
              posOfPattern := countSi;
              RETURN; (* tell the world so *)
            END; (* if countSi < lenS *)

        ELSE (* more than one character to be matched *)
          ch := pattern[endP]; (* start at back end *)
          WHILE countSi + endP < lenS
            DO
              IF ch = stringToSearch[countSi+endP]
                THEN (* got a first character match *)
                  countP := 0; (* do look for more *)
                  countSr := countSi;
                  LOOP
                    IF pattern[countP] <> stringToSearch[countSr]
                      THEN
                        EXIT;
                      END; (* if pattern[countP] *)
                    INC(countP);
                    INC(countSr);
                    IF countP = endP (* match whole thing *)
                      THEN
                        patternFound := TRUE;
                        posOfPattern := countSi;
                        RETURN;
                      END; (* if countP = endP *)
                  END; (* loop *)
                END; (* if ch = stringToSearch *)
            INC(countSi);
          END; (* while *)
        END; (* if endP = 0 *)
    END; (* endP <> 0 *)
END FindNext;

However, a close examination of the progress of this algorithm shows us that some comparisons are not necessary. If for instance, one has matched the first two letters "si" and there is then a mismatch, one already knows that the initial "s" cannot match the second letter "i" and so the search can be advanced by two places rather than one, and one comparison can be saved. Sometimes the potential savings are more dramatic:

Example:

Find the pattern "gead" in the target "geaageabgeacgead"

Progress:

geaageabgeacgead
gead

The first three letters match, but when the fourth letter fails to match, they are known not to match the first letter either, so the pattern can be moved over to:

geaageabgeacgead
    gead

then to

geaageabgeacgead
        gead

and finally to

geaageabgeacgead
            gead

where a match on all four characters is finally obtained. Notice that this particular search turns out to be O(n), where n is the length of the target string. However, the ability to slide the matching process over by three characters depends on the fact that there are no repetitions in the pattern string. If there are repetitions, the amount that the search can be shifted depends on the place where the repetition is found in the pattern.

Example:

Find the pattern "papa" in the target "papuapapyruspapa"

Progress:

papuapapyruspapa
papa

This time, when fails to match, the pattern can be moved over only by two places because of the repetition of the first letter.

papuapapyruspapa
  papa

Similar considerations apply to a later match of "pap" followed by a mismatch.

In order to implement this strategy, therefore, it is necessary to search the pattern within itself looking for repetitions in order to determine how far the pattern can be moved after a failure at each position. Since this means examining the pattern, and then following up by examining the target, this algorithm is O(m + n). It is called the Knuth-Morris-Pratt algorithm, after the three who originated and refined it.

Knuth-Morris-Pratt search =
  scan pattern and construct "backup" vector (table)
  scan the target
    while characters match, advance in both pattern and target
    on mismatch, shift by amount in "backup" vector for that position

Example:

Find the pattern "cashcar" in the target "xcucatcastcashewcashcucashcatcashcart"

Step 1

(construct backup vector): (T = target and P = pattern)

Matched Mismatch@ backup meaning action
pos letter [j]
none 0 c -1 [0] # c next T pos, resume P at 0
c 1 a 0 [1] # a maybe c keep T pos, resume P at 0
ca 2 s 0 [2] # s maybe c keep T pos, resume P at 0
cas 3 h 0 [3] # h maybe c keep T pos, resume P at 0
cash 4 c -1 [4] # c next T pos, resume P at 0
cashc 5 a 0 [5] # a maybe c keep T pos, resume P at 0
cashca 6 r 2 [6] # r, got "ca" next T pos, resume P at 2

The term "backup" refers to the position in the pattern that the search can be backed up to in order to continue examining the target without wasting comparisons. The value -1 will serve to flag that the position in the target is to be incremented for the next comparison. When the scan "backs up" to the position -1, both the pattern and target counters will be incremented; when it backs up to the position 0 (or to any other positive number), only the target position will be incremented before continuing comparisons. Here is the KNP (Knuth-Morris-Pratt) version of the FindNext procedure:

PROCEDURE FindNext (pattern, stringToSearch : ARRAY OF CHAR;
           startIndex : CARDINAL;
           VAR patternFound : BOOLEAN;
           VAR posOfPattern : CARDINAL);

(* Knuth Morris Pratt version 
  adapted by R. Sutcliffe
  last modified 1995 05 04 *)

VAR
  count1, count2, sCount, pCount, patternLen, targetLength: INTEGER;
  backup : Vector;
  
BEGIN 
  targetLength := LENGTH (stringToSearch);
  patternLen := LENGTH (pattern);

(* construct backup vector *)
  count1 := 0;
  count2 := -1;
  backup [0] := -1;
  REPEAT  (* look for match in pattern with itself *)
    IF (count2 = -1) (* starting (over) *)
       OR (pattern [count1] = pattern [count2]) (* (still) matching *)
      THEN
        INC(count1); (* keep going *)
        INC(count2);
        IF pattern [count1] # pattern [count2]  (* no match there *)
          THEN  (* keep same count *)
             backup [count1] := count2;
          ELSE (* keep same backup *)
            backup [count1] := backup [count2];
          END; (* if pattern *)
      ELSE
        count2 := backup[count2];
      END; (* if count2 *)
  UNTIL count1 >= patternLen;

  sCount := startIndex;
  pCount := 0;
  REPEAT
    IF (pCount = -1) (* restart flag *)
        OR (stringToSearch [sCount] = pattern [pCount]) (* matching *)
      THEN
        INC(sCount); (* move to next pos in source *)
        INC(pCount); (* if was -1, now 0 for restart *)
      ELSE  (* take backup no restart flag or match same source pos *)
        pCount := backup [pCount];
      END; (* if pCount *)
  UNTIL (pCount >= patternLen) OR (sCount >= targetLength);
  IF pCount >= patternLen
    THEN  (* got it so report back *)
      posOfPattern := sCount - patternLen;
      patternFound := TRUE;
    ELSE  (* failed *)
      posOfPattern := targetLength;
      patternFound := FALSE;
    END; (* if pCount *)
END FindNext;

Unless the pattern being searched for has considerable repetition, this method may not yield much better results than the brute force method in practice. Yet another method that provides much faster searching capability is the Boyer-Moore algorithm; however, it will not be discussed here as it is rather more complicated.


Contents