Architecting systems with modern C++

August 25th, 2023

TL;DR; Work Contracts is an alternative to MPMC queues for use in processing asynchronous tasks. It offers superior performance when compared to leading lock-free MPMC queue implementations - especially under high contention.

In the beginning:

Many years ago I came up with a technique for managing asynchronous tasks which, to the best of my knowledge was, and still is, very different from existing approaches. The typical approach involves pushing tasks into a queue and executing the tasks once they reach the front of the queue. In general it is desirable to have many threads consume the tasks from the queue concurrently thereby allowing multiple tasks to be executed either concurrently or in parallel on different CPUs. For the sake of efficiency, the queue typically takes the form of a multi-producer/multi-consumer queue (MPMC Queue) and is usually a "lock free" design which improves throughput by reducing the overhead required to synchronize access to the queue during push/pop operations. Creating an efficient lock free MPMC queue is a non trivial task and can be prone to subtle race conditions if not implemented with exacting attention to details. Fortunately, there are many implementations of lock free MPMC queues which are widely available. Some are more efficient than others, and, each has their own caveats and trade offs such as reduced throughput under heavy contention, weak ordering (tasks can be consumed in a different order than that under which they were produced), fixed capacity queues, dynamic capacity queues, etc.

Obviously, it is desirable to reduce the overhead of producing and consuming from a MPMC queue as much as is possible since each task consumed will have its overall latency penalized by the amount of time it takes to synchronize the removal of the task from the queue. In low latency environments it is critical to reduce this penalty much as is possible. However, by their very nature, all MPMC queues require synchronization and so there is always some latency penalty to be paid. And, as it turns out, synchronizing access to a task queue is almost never the end of the synchronization woes. But more on this later.

What we are (usually) trying to solve?

Over the decades I have concluded that the following conditions are generally true where asynchronous tasks are concerned:

Tasks are often repeated over and over: Examples include reading from a socket, parsing a message, redrawing a user interface, etc.
Most tasks are part of a longer chain of predictable tasks: Very often a task is scheduled as a result of some event and is highly deterministic. Examples are reading from a socket once it has been selected, re-rendering a view when the mouse scroll wheel has been turned, backing up data to persistent storage when that data has been changed.
Task are usually intended to be executed by a single thread: Examples include parsing a block of data, reading from or writing to a socket (I do a lot of networking in case you can"t tell), running some calculation on a block of data, etc.
It is desirable to allow idle threads to be able to process any available task: Work stealing queues are a common enhancement to task queues. Many systems would rather have an idle thread do some available task than to have it sit idle simply because its own dedicated task queue is empty. While there may be good reason not to allow task stealing, having per-thread dedicated task queues can lead to suboptimal performance as well unfair task scheduling. It is always possible to prevent task stealing but it is a plus to have a system which allows for it just in case.

In a lot of instances callbacks are a far better choice for handling highly deterministic events than are task queues. However, managing callbacks is notorious for producing software which is particularly difficult to navigate. This is especially true in an asynchronous environment. As such, it's little wonder why task queues are often preferred over callbacks for handling tasks and events asynchronously. Conversely, a generic task/event queue approach can all to easily forfeit performance in situations where the nature of that asynchronous task/event is known up front.

For example, consider a common situation where software polls an interface for incoming packets and then tasks sockets to receive those packets. Since our interest here is accomplish this behavior asynchronously we will have to perform these socket receives with threads other than the polling thread.

With a task queue based architecture it becomes necessary to create and queue a task which will eventually call receive for the socket. Presumably this task, being part of a general task queue system, will be abstracted in some fashion such that it can be customized for our specific task; i.e. to schedule receive for the socket. This might be done by overriding virtual functions on some base task class, or by use of std::function or some similar approach, or by any number of another approaches entirely. Furthermore, the specifics of which socket to receive must somehow be encapsulated within that task for use in the future at the time when receive is actually scheduled. Further complications include how that task is allocated. Is it from a pool? Is it directly copy constructed into a slot within a fixed length circular queue? How big is this allocation? Is it sufficiently sized to accommodate any sized task? Are all tasks the same size regardless of their nature? Let's pause to picture what such a system might look like:

Additionally, suppose a second task were in the queue to operate on the same socket. How does this system prevent a second thread from processing this second task while the initial receive task might still be underway?

So many complications for a situation where the task (call receive on the socket) is deterministically the next logical task once the socket has been polled. If only there were a way to get the benefits of callbacks in such situations. After all, when a socket has been selected by the polling thread, we know that the next task is to schedule receive and we know which socket is involved.

Work contracts: Pros, cons and caveats:

Caveats:

The "Work Contract" approach is not a one size fits all, drop in replacement for "Task Queues" for all situations.

Pros:

Using work contracts often mandates a different approach to software architecture. But in many cases work contracts can radically simply software as well as automatically provide thread safety at zero cost. More on this in the summary.
Work contracts is in an incredibly "fair" system for executing tasks. See the benchmarks for stats on how fairly each work contract is serviced. If you require a system that guarantees there will be no starvation and all but guarantees that all tasks will be serviced with virtually absolute equality, then work contracts might well be for you.
Lock free design is incredibly efficient. Again, see benchmarks for stats. But the benchmarks clearly demonstrate that work contracts can significantly outperform very well crafted MPMC queues.

Cons:

Work contracts will not be processed in the order in which they were scheduled. The order of execution of individual work contracts is non deterministic. If you require strict, or even weak, ordering for tasks from "scheduling to execution" (the equivalent of "push to pop" in queue parlance) then work contracts is not the solution for you.
The work contract system is NOT a queue of any kind. It can not be used as such. It is an alternative design that simplifies asynchronous task management and is more efficient than MPMC queues for that purpose.

Queues? We don't need no stinking queues!

In contrast to the classic "Task Queue" approach the "Work Contract" approach eschews queues entirely. Instead, "Tasks" (referred to as "work contracts") are stored as leafs of binary heaps which support lock free manipulation. Work contracts are sufficiently different from tasks while accomplishing the same basic goals. As such, it is beneficial to establish some basic terminology prior to continuing.

work contract: an object defining a specific work function (a task) as well as an optional release function (a kind of task destructor)
work contract group: a container for a group of work contracts (effectively the lock free binary heap)
work function: a repeatable function that will be executed asynchronously after the work contract has been scheduled
release function: an optional function which will be executed asynchronously once after the work contract has been released
scheduled: when a work contract is scheduled is it flagged as ready for asynchronous execution of its work function (the equivalent of enqueue for task queues)
executed: the act of executing the work function via a worker thread asynchronously (the equivalent of dequeue for task queues)
released: the final state of a work contract prior to its destruction. When a work contract is released it is flagged for destruction and
scheduled one final time. However, because it is flagged for destruction when it is asynchronously executed for the final time, the release function shall be executed rather than the work function. It is fair to think of the release function as akin to a destructor for the work contract. Clever use of the release function can be used to create safer and more efficient code. See the source code for examples.

The basic life cycle of a work contract is:

schedule will flag the work contract as ready to have its work function executed by a worker thread
release will flag the work contract as ready to have its release function executed by a worker thread

if the contract has been flagged as scheduled then execute the work function and then return the work contract to the non-scheduled state (2)
if the contract has been flagged as release then execute the release function, return the contract to the work contract group for reuse (1)

Work contracts come with some additional guarantees which are very important to note and which can be exploited to ensure higher quality, safer code.

A work contract is executed in a single threaded (thread safe) fashion. It is not possible for more than one worker thread to execute a work contract's work function, nor its release function simultaneously.
Once scheduled, subsequent calls to schedule the work contract are extraneous and do not further mutate the current scheduled state. This condition is true until such time as the work contract has been executed and has then been transitioned back to its unscheduled state.
Once released, a work contract is internally flagged as such. Once flagged a work contract shall never have its work function executed ever again. Its next , and final, asynchronous execution will be its release function.

Work contracts are typically positioned within an object upon which their work function shall operate. Returning to our poller and socket example, the equivalent illustration for work contracts would be:

In the above illustration a work contract is created and its work function is assigned to execute receive on the associated socket. When the polling thread selects the given socket it schedules the work contract associated with that socket which flags the work contract as ready for asynchronous processing. When the worker thread identifies a scheduled work contract it executes the work contract's work function, which is to call receive on that socket. Here we stop to note the following:

scheduling the work contract is a wait free/lock free action which is as efficient, if not more so, than calling push on an MPMC queue.
Locating the scheduled contract is also a wait free/lock free action and is as efficient, if not more so, than calling pop on an MPMC queue.
Since scheduling an already scheduled work contract does nothing, additional schedulings of the work contract (by the poller) while already selected for asynchronous receive can not create additional "tasks" as is otherwise possible with an MPMC queue.
Since the execution of a work contract is guaranteed thread safe (only one worker thread can execute the work contract at a time) it is not possible to have two worker threads scheduling socket receive on the same socket simultaneously.

Benchmarks: "Show me the numb-ahs!"

Ignoring my Boston accent ... for the sake of those who want to see performance numbers before investing time into reading code examples, we now skip right to the fun part - the benchmarking. It is difficult to draw a direct comparison of work contracts to any given MPMC queue because work contracts is intended to solve a particular usage case where MPMC queues are typically used. But any given MPMC queue implementation likely isn't specifically intended for use exclusively as an asynchronous task queue. With this in mind I have chosen to compare work contracts against two of the most efficient MPMC queues. Specifically, the ConcurrentQueue from MoodyCamel and TBB concurrent_queue. MoodyCamel's ConcurrentQueue has several modes of operation. It is beyond the scope of this benchmark to do any fine tuning to identify the best mode of operation here. Consequently, this benchmark uses the default mode for ConcurrentQueue as it is "out of the box".

The test machines:
Tests where taken on two different machines. In each case only the machine's physical cores are utilized (no hyperthreading) and each worker thread has its affinity set to a different physical core.

AMD Ryzen 9 3900 12-Core Processor
13th Gen Intel(R) Core(TM) i9-13900HX

OS: Ubuntu 18.04 for Intel tests, Ubuntu 23.04 for AMD tests
Compiler: The compiler used is GCC 12.3.0

The benchmark Tests:
Low Contention: Task is to calculate number of primes up to 1000 using sieve of Eratosthenes
Medium Contention: Task is to calculate number of primes up to 300 using sieve of Eratosthenes
High Contention: Task is to calculate number of primes up to 100 using sieve of Eratosthenes
Maximum Contention: Task is empty function which does nothing

The process:
Create a simple task which consumes a modest amount of CPU when executed and repeatedly execute that task asynchronously for some duration of time (10 seconds in this case). This task is specifically crafted as to avoid contention for resources across threads.
The test for both Moody Camel and TBB is as follows:

create a fixed number of these tasks and enqueue them
create a fixed number of worker threads which will dequeue the next task, execute that task and then re-enqueue that task again.

The test for the Work Contracts is as follows:

create a fixed number of work contracts and assign each contract to use the task as its work function. Then schedule each work contract.
create a fixed number of worker threads which will execute these work contracts and then re-schedule that same work contract.

Finally, after the 10 seconds elapses the worker threads are shutdown and several metrics are gathered. These include, the overall number tasks executed, the average number of tasks per thread per second, standard deviation of each individual task's execution count, and standard deviation of how many tasks each thread executed.

Low Contention: Task is to calculate number of primes up to 1000 using sieve of Eratosthenes

This task is sufficiently complicated which causes the worker threads spend more time executing the task than competing for new tasks. This greatly reduces the contention on the task queue or work contract group. Yet still, it is clear that the MPMC queues begin to faulter as the number of worker threads increases. Results on Intel appear to be improved for all three implementations, and work contract is shown to be almost unaffected by the increase in threads.

Medium Contention: Task is to calculate number of primes up to 300 using sieve of Eratosthenes

Same as the "low contention" test but with reduced task execution time, thereby increasing the contention on the task queue or work contract group. Again, it is clear that the MPMC queues begin to faulter as the number of worker threads increases. And again, the results on Intel appear to be improved for all three implementations, and work contract is shown to be almost unaffected by the increase in threads.

High Contention: Task is to calculate number of primes up to 100 using sieve of Eratosthenes

Same as the "medium contention" test but with a much reduced task execution time, thereby increasing the contention on the task queue or work contract group to nearly maximum. As contention increases all solutions begin to perform worse as threads increase. However, the degradation in performance for work contracts is far less than it is for MPMC queues.

Maximum Contention: Task is empty function which does nothing

By eliminating any work within the task this test creates the maximum contention on the task queue or work contract group. Finally, all solutions are heavily affected by the constant contention with all work removed. However, work contracts consistently out performs the MPMC based solutions. At this point TBB's performance has degraded markedly.

Coefficient of Variation for execution of tasks (High contention at 10 threads):

Measures how evenly tasks were executed overall. With no prioritization we should expect that all tasks have the same chance of being executed.
A low value indicates that tasks were treated fairly. A high value suggests that there is a bias in the task selection mechanism. This measurement indicates that the mechanisms in the Moody Camel implementation which (I believe) are the source of its speed, causes a clear skew in which tasks will get processed over others. This indicates that Moody Camel ConcurrentQueue might not be a good selection for a task queue if "fairness" is required.

	TBB	Moody Camel	Work Contract
Intel	< 0.001	0.759	0.001
AMD	< 0.001	0.706	0.001

Coefficient of Variation for work executed by each thread (High contention at 10 threads):

Measures how evenly tasks were distributed across threads overall. It is generally assumed that all worker threads will do approximately the same amount of work throughout the test. A low value indicates that tasks were executed as expected - evenly across all threads. A high value suggests a bias which causes some threads to do less work than others. This measurement indicates that TBB has difficulty balancing work evenly across threads when contention is high (particularly on Intel) And that some worker threads are doing far less work than others.

	TBB	Moody Camel	Work Contract
Intel	0.882	0.081	0.104
AMD	0.153	0.052	0.096

Usage guide:

How to use work contracts as well as code examples are found in part two