Instruction Scheduling

April 2021

The execution time of a sequence of operations depends heavily on the order of the set of operations. Instruction scheduling is the process of reodering the operations to improve running time. Instruction scheduling is hard because:

There are dependencies between the operations; one cannot be issued before its operands are ready.
Different operations have different latencies.
Most processors are superscalar processors with different functional units that can execute multiple operations concurrently.

The instruction scheduling problem is defined over the dependence graph D of a basic block. The edges in D represent flow of values. The nodes in D, which correspond to operations, have two attributes: optype, which specifies the functional unit, and delay, which specifies the number of cycles required for completion.

Given a dependence graph, a schedule S maps each node in D to a nonnegative integer that denotes the cycle in which it should be issued, assuming that the first operation issues in cycle 1. The i-th instruction is just the set of operations that issue in cycle i.

A schedule must meet three constraints:

No operation can be issued before execution starts.
An operation cannot issue until its operands have been defined.
Each instruction contains no more operations of each type t than the target machine can issue in a cycle.

An operation x is antidependent on y if x precedes y and y defines a value used in x. Antidependences are resolved either by respecting them in the final schedule (through further reordering) or renaming the values (which may increase register demands and cause spillage).

Instruction scheduling is NP-complete. In practice, approximate solutions to scheduling problems are generated using greedy heuristics. Most scheduling algorithms are based on one family of heuristic techniques called list scheduling.

List Scheduling

List scheduling is a greedy, heuristic approach to scheduling the operations in a basic block (can be extended into multi-block sequences). The basic outline of the algorithm is:

Rename values to avoid antidependences.
Build a dependence graph.
Assign priorities to each operation.
Iteratively select operation and schedule it.

The algorithm uses a cycle counter and two lists to track operations. The Ready list holds all the operations that can execute in the current cycle. The Active list contains operations that were issued earlier but have not yet finished.

At each step, the algorithm accounts for any operations completed in the previous cycle by examining Active and adds some of their parents into Ready, then it schedules one operation in Ready for the current cycle and increments the cycle counter (assuming a processor with a single functional unit).

The quality of the schedule produced depends primarily on the mechanism used to pick an operation from Ready. Various priority ranks may be used. None dominates the others in terms of the overall schedule quality.

Latency-weighted path length: operations that are on the critical path are scheduled first
Total number of descendants: operations that compute critical values for many other nodes are scheduled first
Delay: long-latency operations are scheduled first
Number of last use operands: move last use values closer to their definitions, reducing demand for registers.