13.1 Searching

The type ARRAY has been used extensively in several chapters thus far, to store and manipulate itemized information of the same type. Now suppose one goes beyond mere storage and asks questions about retrieval systems for data that has been organized this way. Specifically, suppose one has an array of some type of data and wishes to discover whether or not the value of some separately held item of that same type is already in the array, and if so, at what position. (i.e. What is its index?)

13.1.1 Linear (Sequential) Searches

The most straightforward solution to this problem is to start with the lowest array index, and compare the value of the separately held item in question systematically to every item in the array to see if it is equal to one of them.

The search of a list of items in consecutive order by examining each item in the list in turn is called a linear or, sequential search.

A linear search does not depend for its success on the items being searched already being in any particular order. The value being sought will either be equal to one of the ones already in the array, or it will not. If it is, the index number of the position at which it is found can be returned. Otherwise, some indication that the search has failed must be passed back. Here is a procedure to do just that:

PROCEDURE Find (item: ItemType;
    offset, num : CARDINAL;
    source: ARRAY OF ItemType;
    VAR found: BOOLEAN; VAR pos : CARDINAL);

(* searches for the item in the array source from offset to end
Pre: offset is the number of positions after the first at which the search is to begin; num is the number of positions after the offset that are to be examined. 
offset + num  <= HIGH ( source)
Post: if the value is found in the array, the cardinal returned is the number of positions after start where it is found and found is set to true.  If the value is not found, the cardinal returned is undefined and found is set to false *)

BEGIN
  pos := offset;
  found := FALSE;
  WHILE (num > 0) AND NOT found
    DO
      IF item = source [pos]
        THEN
          found := TRUE;
        ELSE
          INC (pos); (* check next one *)
          DEC (num); (* decrease number remaining *)
        END (* if *)
    END; (* while *)
END Find;

Because an open array parameter is employed in this procedure, it can be used for any size array. However, since not all the positions of the array may be in use, and to allow some to be skipped at the beginning, the starting point relative to the first item, and the number of items to examine are both passed. If the item value being searched for is found, its relative index from offset will be returned, along with a boolean flag set to true. If the entire specified range of the array is searched without success, the flag is set to false and the value in pos will be invalid. (The logic implemented here will actually result in pos being set to offset + num, but the specifications could have been met in some other manner.)

Notice that this procedure cannot be used as is to find an element of any type whatever in an array of that type, just by changing a few names. It will work as it is presented here only for values of those types for which the relation "=" is meaningful. That is, it works for integer, cardinal, real, character , and boolean types, but not for arrays of these.

However, if the line where the comparison is done is replaced by:

IF Compare (item, source [count]) = equal;

where

TYPE
  CompareResult = (less, equal, greater);

and

PROCEDURE Compare (item1, item2 : ItemType) : CompareResult;

are suitable (possibly imported) procedure for making such comparisons on the data type in question, then this procedure will approach universality. It will need only the specific type name inserted to create a specific instance of it to work in any situation.

To cast this specifically as searching an array of strings to find a specific string, and assuming that the enclosing module contains the lines:

FROM Strings IMPORT
  CompareResult, Compare;

TYPE
  String : ARRAY [0 .. maxSize] OF CHAR;

yields (with the comments removed to avoid repetition:)

PROCEDURE Find (item: String;
    offset, num : CARDINAL;
    source: ARRAY OF String;
    VAR found: BOOLEAN; VAR pos : CARDINAL);

BEGIN
  pos := offset;
  found := FALSE;
  WHILE (num > 0) AND NOT found
    DO
      IF Compare (item, source [pos]) = equal;
        THEN
          found := TRUE;
        ELSE
          INC (pos); (* check next one *)
          DEC (num); (* decrease number remaining *)
        END (* if *)
    END; (* while *)
END Find;

13.1.2 Binary Searches

The linear search algorithm works quite well if one is looking through rather small lists. Assuming that of the total number num of items in the part of the list being searched, each is equally likely to be the target, a linear search, will on average make num/2 comparisons in those cases where the item is found, and num comparisons in those cases where the search fails. The overall average will depend on how often one searches for something that is not there, but will in any case be proportional to the number num of items in the search range of the array.

The idea that the worst-case time taken for an algorithm is proportional to some function f of the size n of the problem is expressed by saying that the algorithm is order f(n). Notation: O(f (n))

A linear search is O(n). One can do very much better than this, however, if the array items through which one is searching have been sorted beforehand. Suppose they are arranged from smallest to largest. One can now make more intelligent comparisons that allow large portions of the array to be ignored as unsuitable for searching without first making comparisons with every item in those portions.

This is done by making the first comparison with the middle item in the search range. If the target value is greater than this, the search need subsequently consider only the half of the range above this point; if it is less than this one, then only the half of the range below this point. In either situation, the algorithm ignores as unnecessary for any further searching half of the items. Of course, if the comparison produced equality, the match has been found and the routine can exit.

Applying the same principle repeatedly to the remaining sub-range not already eliminated, at each step discard half of the rest until either there is a match or nothing remains with which to work. The latter condition exists if the counter on the top of the range is decremented below the counter on the bottom of the range. Since this could mean that the top counter becomes -1, if the last item looked at was item zero, the counters should be integers.

Successive partition of a sorted range of items into halves to find a specified value is called a binary search.

Given an array [0 .. n-1] of items, and a value to search for:

Binary search algorithm:
  bottom = 0
  top = n - 1
  found = false
  repeat
    search position = (top + bottom) DIV 2;
      if value < source [search position] then   (* ignore top half *)
        top = search position - 1;
        (* by setting new top just below middle *)
      elsif value > source [search position] then
           (* ignore bottom half *)
        bottom := search position + 1
        (* set new bottom just above middle *)
      else   (* must be equal *)
        found = true;
      end (* if *)
  until found or (top < bottom);

A numeric example using this algorithm is stepped through below:

Example:

Given the following sorted list of numbers

11, 21, 46, 52, 55, 67, 79, 81, 86, 97, 99

(i) search for the value 81.

Solution:

The algorithm proceeds as follows:

1. The terms are numbered from 0 .. 10. The first search position is 5, at which is found the number 67. The value 81 is compared to it and 81 is greater.
2. Next the items from position 6 through 10 are considered. These are 79, 81, 86, 97, 99. Here, the search position is 8 at which is the number 86, and 81 is less than this.
3. Thus the list is reduced to items 6 through 7. The remaining list is 79, 81. The search position item is the first one holding 79, and 81 is greater than this.
4. All that remains is the list with item 6 or 81 in it. The one wanted is equal to this, and so has been found. If it were not, then at the next step the bottom counter would be greater than the top counter; there would be no list left, and the search would fail.

(ii) search for 50 in the list.

The successive item numbers at the beginning of each step, the lists to be checked, and the search positions and results for each step are:

range	        list	                pos	   result	     new range
1) [0 .. 10]	the whole list;	        5 (67)	   less	             [0 .. 4]
2) [0 .. 4]	11, 21, 46, 52, 55;	2 (46)	   greater	     [3 .. 4]
3) [3 .. 4]	52, 55;	                3 (52)	   less	             [4 .. 3] fails

(ii) search for 10 in the list.

range	        list	                pos	   result	     new range
1) [0 .. 10]	the whole list;	        5 (67)	   less	             [0 .. 4]
2) [0 .. 4]	11, 21, 46, 52, 55;	2 (46)	   less	             [0 .. 1]
3) [0 .. 1]	11, 21;	                0 (11)	   less	             [0 .. -1] fails

Having now hand traced the algorithm to make sure that it works, it is time to translate this into a working procedure. As in the linear search, the binary one returns both the index, and a flag to indicate the validity of the data. To maintain compatibility with the last search procedure, the parameters are set out in the same manner, with an offset at which to begin within the array, and a maximum number of positions to search beyond that point. In addition, the algorithm has been reformulated in a while loop, and to ensure that the condition top < bottom for exit does not mean either top < 0 (integer type needed) or bottom > HIGH (array) (bad index error) the loop is exited as soon as top = bottom and then a last check is done to probe the item at that position.

PROCEDURE FindB (item: ItemType;
    offset, num : CARDINAL;
    source: ARRAY OF ItemType;
    VAR found: BOOLEAN; VAR pos : CARDINAL);

(* searches for the item in the array source from offset to end
Pre: offset is the number of positions after the first at which the search is to begin; num is the number of positions after the offset that are to be examined. 
offset + num  <= HIGH ( source)
Post: if the value is found in the array, the cardinal returned is the number of positions after start where it is found and found is set to true.  If the value is not found, the cardinal returned is undefined and found is set to false*)

VAR
  bottom, top, max : CARDINAL;

BEGIN
  bottom := offset;
  max := bottom + num;
  top := max;
  found := FALSE;

  WHILE (bottom < top) AND NOT found
    DO
      pos := (top + bottom) DIV 2;
        IF (item < source [pos])
          THEN   (* ignore top half *)
            top := pos - 1;
          ELSIF (item > source [pos])  THEN
            bottom := pos + 1   (* ignore bottom half *)
          ELSE  (* item = source [pos]  *)
            found := TRUE;
          END; (* if *)
    IF NOT found (* but bottom = top now *)
    THEN (* have not yet probed at bottom/top *)
      pos := bottom;
    IF item = source [pos] 
      THEN
        found := TRUE;
      END;
    END;
  END; (* while *)

END FindB;

Just how much better is the binary search than the linear one? In the list of eleven items in the range [0 .. 10] a maximum of four comparisons were required, though the actual number could be less if the value is found. In the linear case, the average number of comparisons would be 5.5 and the maximum number would be eleven. The number of linear comparisons depends on n, the number of items, (it is O(n)) but the number of binary method comparisons depends on the number of times t that one could take half of n before reaching a list length of one. This number is the same as the number of times one would have to double the number one to get n. Writing 2t = n, and taking logarithms with base two on both sides yields t = log2 (n).

A binary search is O(log2 (n)).

For an array of 1000 items, one could be unlucky enough to require 1000 comparisons in a linear search (though the average would be 500), whereas the maximum number of comparisons to search the same data in binary fashion is ten. The savings become truly enormous with large lists. Note however, that the array must be sorted before being searched in binary fashion, and that the sorting will also take time.

However, for smaller arrays, the extra overhead involved in the binary search will be more important than the savings in making fewer comparisons. When the two methods are both used on the same arrays and the results compared, it turns out that a linear search is often more efficient below about 100 items. Since the time ratio between the extra arithmetic on the one hand and the extra comparisons on the other will vary depending on the architecture of the computer, it is not possible to state that this boundary is accurate for all cases. (The student is invited to find a more accurate limit as one of the exercises.)

It is also worth noting that the binary search has two comparisons in the IF statement in the inner loop, and succeeds only when both of those comparisons fail. Moreover, an individual probe at a position pos fails to find the item sought more often than it succeeds. One might therefore try to improve the algorithm by having only one comparison inside the inner loop, ensuring that the search converges to a single item and then exits. The test for equality on that single item can then be done once, outside the loop. These ideas can be expressed as follows:

PROCEDURE FindB1 (item: ItemType;
    offset, num : CARDINAL;
    source: ARRAY OF ItemType;
    VAR found: BOOLEAN; VAR pos : CARDINAL);

(* same predicates *)
VAR
  bottom, top : CARDINAL;

BEGIN
  bottom := offset;
  top := offset + num;
  found := FALSE;

  WHILE (bottom < top)
    DO
      pos := (top + bottom + 1) DIV 2;
        IF item < source [pos]
          THEN   (* ignore top half *)
            top := pos - 1;
          ELSE
            bottom := pos   (* ignore bottom half *)
          END; (* if *)
    END; (* while *)
  pos := bottom;
  IF item = source [pos]
    THEN
      found := TRUE;
    END;
END FindB1;

The improvement in the inner loop that results from having only one comparison instead of two is partially offset by having to use the expression (top + bottom + 1) instead of just (top + bottom). It is also reduced by the fact that the loop will now not exit just because it happens to probe the right place. It only exits when the probe range has been reduced to one item. One might expect that any gains achieved are likely to be less with short lists, where the probability of hitting the item with a probe is higher (if it is there). On the other hand, for unsuccessful searches, the second method is likely to have some speed advantage.

For the sake of completeness, the binary search is also presented below in a more generic form that assumes a Compare procedure is visible in the same scope.

PROCEDURE FindBg (item: ItemType;
    offset, num : CARDINAL;
    source: ARRAY OF ItemType;
    VAR found: BOOLEAN; VAR pos : CARDINAL);

(* same predicates *)
VAR
  bottom, top : CARDINAL;

BEGIN
  bottom := offset;
  top := offset + num;
  found := FALSE;

  WHILE (bottom < top)
    DO
      pos := (top + bottom + 1) DIV 2;
        IF Compare (item, source [pos]) = less
          THEN   (* ignore top half *)
            top := pos - 1;
          ELSE
            bottom := pos   (* ignore bottom half *)
          END; (* if *)
    END; (* while *)
  pos := bottom;
  IF item = source [pos]
    THEN
      found := TRUE;
    END;
END FindBg;

Several variations on the theme are possible. If the user does not mind using integer variables for the indexing and has an extra unused array position available, the indices can be allowed to run past one another without harm being done. In that event, and since a Compare procedure is ternary-valued, a CASE statement in the inner loop could be quite efficient.


Contents