11.5 Efficiency in Large Programs

Students who try running the knight's tour in the last section on some machines with a full eight by eight board will discover that a program does not have to be large to be slow. On the other hand, a program does not have to be slow just because it is large. There may be substantial speed improvements available to the programmer if some careful attention to detail is undertaken. This section contains some observations about the running speed of programs that are a little more advanced than those mentioned in passing in chapter 3. Not all the solutions suggested in this section are possible in all implementations.

11.5.1 Controlling Run-Time Checking

When any assignment to a variable is made, a Modula-2 program must check at run time to ensure that the proposed value is within the legitimate range for that particular entity. If not, a "range error" exception is reported and the program terminates.

By this time, the student should be sophisticated enough to be trapping all keyboard and other input and checking it for the appropriate range before assigning it to variables. This and other methods help to produce idiot-proofed programs, so that "range errors" simply cannot occur in the final running versions and when they do anyway, they are gracefully handled in an EXCEPT clause.

Once this is done, the range-checking code built into the program by the compiler may not be necessary, and indeed slows things down considerably, as far more checks are being done than are required.

Most systems will therefore allow this generation of range checking code by the compiler to be turned off. When permitted, this is commonly done in one of two ways:

1. At the outermost level of the system, when invoking the compiler, there may be an option one can choose that disables the generation of range checking code for all compilations done while this setting is in effect. This could be via an-easy-to-set menu choice, or it could require some cryptic code like "\r" or "-norange" or "check off" to be given on the same line as the command to compile the target file. This method has the advantage of simplicity, but the disadvantage that it applies to all the code compiled; it does not permit the generation of checking routines to be turned off only for selected portions or only for selected situations.

2. Compiler options (pragmas) may be present that allow range or other checking to be turned off and on for just those speed-sensitive areas of the program where performance is being adversely affected. Such an option takes the form used in the last section. Options could include the means to turn index checking, range checking (for subranges) and overflow checking on and off for selected portions of the code. Some implementations allow conditions to be entered in a pragma and then acted upon in another pragma.

Example:

MODULE Versatile;

<* dangerous := TRUE *> (* at top where easy to change *)

(* code to obtain and check value against range *);

<* RangeCheck := FALSE *>
(* code in which a range error cannot occur *)
<* RangeCheck := TRUE *>

<* PUSH IndexCheck *>  (* save current value *)

(* code to obtain and check index *);

<* IF dangerous
  THEN
    IndexCheck := FALSE
  END *>

(* code in which an index error cannot occur *)

<* PULL IndexCheck *> (* restore old value *)

END Versatile.

11.5.2 Compiling For Specified Environments

Some implementations also allow the compiler to generate code for special circumstances. For instance, a given operating system may run on one of several different kinds of machine. Activities such as numerical computations may be done with generic calls to a common software module that is present on all varieties of machine being compiled for. If the specific target machine being compiled for is known to have a numeric co-processor, the compiler may be able to produce specialized code for that machine. The advantage is that the specialized code will run much faster on the target machine. The disadvantage is that it might not run at all on other machines in the same generic group.

Special cases like these may be handled with either command line directives such as "\80387" or "-coprocessor." They may also be handled with pragmas similar to:

<* coprocessor := TRUE *>
<* powerPC := TRUE *>

The programmer must decide whether the speed gain is worth the extra trouble and the specialization of the code to only certain models or situations.

11.5.3 Fine Tuning Loops

As observed in chapter 3, much of a program's time is generally spent on the execution of loops. If a programmer knows how to do it, such sections of code may be speeded up in one or more ways:

(i) by replacing loops with closed forms where possible

This possibility was discussed in section 3.8, and the material there should be reviewed.

(ii) by reducing the number of iterations through the loop

For instance, consider the simple problem of raising a real number r to a CARDINAL power e. This could be programmed as:

PROCEDURE Power (r: REAL; e: CARDINAL): REAL;
VAR
  ans: REAL;
BEGIN
  ans := 1.0;
  WHILE e > 0 
    DO
      ans := ans * r;
      DEC (e)
    END;
  RETURN ans
END Power;

This perfectly sensible solution steps through the loop e times to produce the answer, but one could do much better. Observe that once one has r², one can find r⁴ by r² * r² in one less step than by finding r³ first. Then r⁸ is r⁴ * r⁴ in one step rather than four. The key is that e is reduced in powers of two, rather than by one each time, thus causing less time to be spent in the loop. Here is a more efficient alternative:

PROCEDURE Pwr (r: REAL; e: CARDINAL): REAL;
BEGIN
  IF e = 0
    THEN 
      RETURN 1.0
    ELSIF e = 1 THEN
      RETURN r
    ELSIF ODD (e) THEN
      RETURN r * Pwr (r, e - 1)
    ELSE
      RETURN Pwr (r * r, e DIV 2)
    END
END Pwr;

Notice that this is formulated recursively. To trace the operation of the procedure Pwr, note what happens on each entry to and exit from Pwr, starting, say with

r = 2.0, e =11

Level#	        on entry e =	returns
	 1		11		2.0 * Pwr (2.0, 10)
	 2		10		Pwr (4.0, 5)
	 3		5		4.0 * Pwr (4.0, 4)
	 4		4		Pwr (16.0, 2)
	 5		2		Pwr (256.0, 1)
	 6		1		256.0

The returns, reading back up from the bottom are: 256.0, 256.0, 1024.0, 1024.0 and 2048.0 The code is executed six times instead of the ten times a straightforward loop would take. Consider a second example, say

r = 2.0 and e = 17

Level #	        on entry e =	returns r =
	 1		17		2.0 * Pwr (2.0, 16)
	 2		16		Pwr (4.0, 8)
	 3		8		Pwr (16.0, 4)
	 4		4		Pwr (256.0, 2)
	 5		2		Pwr (65536.0, 1)
	 6		1		65536.0

Here, the last entry produces 65536.0 which is passed back up the chain to the first step where it is doubled and 131072.0 is returned as the final result. The savings (6 steps instead of 16) are rather more dramatic, and the difference become larger as e increases. For instance, for e equal to 1000, the successive values of e on entry are 1000, 500, 250, 125, 124, 62, 31, 30, 15, 14, 7, 6, 3, 2, and 1 for a total of 15 steps, instead of 999.

(iii) by replacing loops written in Modula-2 with machine language instructions.

This solution is not encouraged in many versions of Modula-2, as the language is designed with enough low level facilities to make machine language intervention unnecessary (supposedly) in most cases. Neither is there any standard way to incorporate machine language instructions to an assembler directly in a Modula-2 program, though a non-standard extension might be available. Usually this comes in the form of a CODE procedure, or other in-line native code generating methods. Some programming environments will have such facilities, but their use will vary from machine to machine, and those who believe they can code machine language more efficiently than can the compiler they are using will have to consult the implementation manuals because the ISO committe declined to standardize any particular such facility.

(iv) by moving code that repeats an identical calculation outside the loop

This was discussed briefly in Chapter 3 as well, but deserves a more extended treatment, as this kind of problem is not often caused by a simple statement misplacement, but is usually more subtle.

If code contains a line such as

FOR row := 1 TO HIGH (array)
  DO
    some stuff
  END;

the calculation of the value of HIGH (array) is only done once, at the beginning of the loop, and the current value of count is compared to the previously computed value of HIGH (array) on each iteration. Thus, there is no advantage to computing HIGH (array) and saving it in a variable before entering the loop. However, it is easy to neglect that there is such an advantage if this loop is nested inside another one. Thus:

FOR row := 1 TO HIGH (array)
  DO
    FOR col := 1 TO HIGH (array [0])
      DO
       some stuff
      END;
  END;

is not efficient, because the calculation of HIGH (array [0]) for the inner loop is done on each iteration of the outer loop. This code should be replaced by:

colMax := HIGH (array [0]);
FOR row := 1 TO HIGH (array)
  DO
    FOR col := 1 TO colMax
      DO
       some stuff
      END;
  END;

Similarly, code such as

FOR count := min TO max
  DO
    IF num := LENGTH (str)
      THEN
        do something
      END;
  END;

causes the (possibly lengthy) computation of LENGTH (str) on each pass and should be replaced by:

len := LENGTH (str);
FOR count := min TO max
  DO
    IF num := len
      THEN
        do something
      END;
  END;

Similar considerations may apply to the results of any procedure call (whether a pervasive procedure or one defined by the user) contained inside a loop. Moving the call to a point prior to the loop may save a great deal of unnecessary machine time.

(v) Use the most efficient language construction available

Examples: (most of these noted previously)

INC(n, d) is usually faster than n := n + d (unless the compiler optimizes the latter)
a CASE statement (where applicable) can be expected to be faster than an equivalent IF statement.
Logical operations such as comparisons are faster than arithmetic operations
aliasing data with variable parameters is faster (but more error prone) than copying them using value parameters.

While not usually recommended, this might be the best approach when the parameter is a large array, for the copying needed to use value parameters might consume an excessive amount of running time.

(vi) Try unrolling some loops such as

WHILE count < 1000
  DO
    statement sequence;
    INC (count)
  END;

may in some cases run much faster if the loop is unrolled somewhat by putting two steps inside the loop:

WHILE count < 1000
  DO
    statement sequence;
    INC (count);
    IF count < 1000
      THEN
        statement sequence (* same one as above *)
        INC (count)
      END
  END;

because the IF inside the loop may execute more efficiently than the code for re-entering the loop itself, particularly if the statement sequence is long, the number of iterations high, or the loop condition complex. Whether this is so or not is highly implementation dependent, but it is worth trying when it is necessary to squeeze out every drop of speed.

(vii) buy an optimizing compiler

Some compilers are capable of examining code for ways to improve the efficiency of what the programmer has written and code it more efficiently. While this approach can produce some gains, the programmer should not rely on this method to turn bad code into good for the results are likely to be mixed. (One may get fast code that is still bad.)

11.5.4 Linking, Program Libraries and Speed

Implementations follow a variety of strategies when it comes to preparing the final running executable code. Usually, a file containing only the code from the program module is generated when the compiler is run. In some systems, there is an option to run this code at this point, and when this is done, all the segments of code from the various libraries that have been imported from must be loaded before the program is run. This may be rather slow. If the programmer is prepared to take a little more time initially to link those code segments into a single file (called an executable file,) each run may proceed more quickly. In some systems, no execution is possible until this linking step is performed.

However, there may be also a facility called a library manager that will allow separate library modules to be collected into a single file known as a library. Implementations vary greatly, but where library managers exist, they often allow the linking step to be performed more quickly than otherwise, because only one file needs to be searched for the various modules needed for a program rather than several.

A collection of the code of two or more modules into a single file for the use by one or more programs is called a program library.

Sometimes the concept is included to incorporate the program module itself into the program library, and in that case the library manager essentially serves the role of linker, and the finished program library is the executable file.

If an option is available, it may be wise not to link until necessary, for linking duplicates the code from the various libraries into the executable file. However, the final program needs these libraries, and if it is to be distributed to others who may not have them, it will probably be necessary to link it eventually.

Also, note that some linkers are better than others. The least efficient will incorporate all the code from each imported library into the final product, whether it is all used or not. The more efficient can not only decline to copy unnecessary code, but can even arrange the various parts it does use for optimal efficiency. The best of such tools are known as optimizing linkers.

11.5.5 Efficiency--A Summary

A budding programmer, might object that some of these considerations and techniques are too abstract and rather full of trickery. The only reasonable response to such an objection would be to point out that the programmer is supposed to be smarter than the machine--if short cuts exist, it is the more intelligent of the two partners who is going to discover them.

While on the subject of efficiency, one should note the following rather useful maxim:

Do not become overly concerned with efficiency because correctness is more important than speed.

--a special case of the more general law:

If it works, don't fix it.

The illustrations provided here show that a little innovation can go a long way, if speed is a problem. A minimal amount of effort may improve program speed by 50% or more. However, at some point there will be a diminishing return--a further doubling of programming time may produce no more than a 10% gain in efficiency. So, a programmer needs to be somewhat concerned with efficiency, but also aware of the trade-off between programming time and running time. Some effort may produce very large gains, but much additional effort only a small gain.

Contents