Nicholas Merriam
Nicholas Merriam Author of Embedded Answers

Tail-call optimization (1)

Tail-call optimization (1)

Tail-call optimization has some features that make it particularly interesting to discuss. Most software developers

  • have heard of tail-call optimization,
  • do not really understand the benefits of tail-call optimization, and
  • do not at all appreciate how widely it can be used.

The objective of these three articles on tail-call optimization is to enable you to know when the benefits are relevant to your project and to allow you to recognize opportunities for tail-call optimization.

What is tail-call optimization?

Imagine we have three functions, a, b and c. The last action of function b is to call function c. The un-optimized code for b would jump to c, which would return to where b would clean up its stack frame and return to a. With tail-call optimization, b clean up its stack frame and then jumps to c with the Link Register (or equivalent) pointing back to a, so that c will return not to b but to a.

Note that GCC refers to tail-call optimization as sibling call optimization because the behaviour is close to what you would get if b and c were both called directly by (children of) a.

What are the benefits of tail-call optimization?

There are two quite separate benefits of tail-call optimization.

  1. The total stack memory is reduced, because the stack frame of c does not nest on top of the stack frame of b. The total reduces from a + b + c to a + max(b, c).
  2. The total execution time is reduced, because on return from c, rather than jumping from c to b and then from b to a, c returns directly to a. The benefit here is larger than you might expect for saving one jump instruction because jumps are relatively expensive for modern microprocessors.

Embedded non-functional requirements may forbid recursion, in which case the following of mostly academic interest. Nevertheless, a special case of tail-call optimization occurs with tail-call recursion. When the tail-call is replaced with a jump, the generated code behaves exactly as a loop. A corollary of this is that tail-call recursion can be replaced, also at the source code level, with a loop. Code that is naturally constructed using recursion can be mechanically transformed into a loop to allow its inclusion in an embedded project.

How can I exploit tail-call optimization?

Very often, our application code does not fit the simple tail-call model, where b does some computation and then finishes by calling c. In the following articles, we will look at ways to profit from tail-call optimization in a wider range of code. For now, we will just look at an example.

I believe the most famous example of tail-call recursion to be quick sort. Quick sort has long been replaced by other algorithms that are objectively better, but at one time it was the preferred way to sort data. Quick sort relies on picking a “pivot” element and partitioning the data into two parts, one smaller than the pivot and one greater, before recursively sorting the two partitions. It has a terrible worst case, which occurs when, each time it is chosen, the pivot has no smaller elements, or no greater elements. In this situation, only the pivot element is removed from the data set at each level of recursion, so that the depth of the recursion is equal to the size of the entire data set. With careful pivot choice, you can try to make this a rare case. If you chose the best of n candidate pivots, the worst case recursion depth is the data set size divided by n, but it remains linear with the amount of data. Maybe this case is rare and you could tolerate that the sorting sometimes take a very long time, but you cannot tolerate the stack overflow that is caused by the recursion depth.

As quick sort has two recursive calls, only one of them can be the tail-call. The trick is to compare the sizes of the two partitions and to ensure that the tail-call (which will be replaced with looping) gets the larger partition and the other call (which will be implemented with a real, recursive call) gets the smaller partition. This implementation ensures that the recursion depth is at most O(log2(n)).