Details and comparison of miRack and VCV Rack multithreading implementation and performance

I implemented multitheading for miRack audio engine back in 2018 when it was a project targeting single-board computers because with slower CPUs it's essentially a requirement to be able to run any decently-sized patches. The implementation (available here and is used in miRack app with minor modifications) is based on idea of having arrays for module input and output values, and a lock-free concurrent work queue implementation by Cameron Desrochers.

For each rendering cycle, say 512 samples (steps), all rack modules are pushed into to a work queue, also with start and end steps to process (so initially that is 1 to 512), then worker threads are woken up. The worker threads dequeue modules from the work queue and check that values for the step being processed are present for all module inputs (for disconnected inputs this is always true). If all values are available, the module is processed and output values are saved straight in input arrays of modules connected to each output for the next step number. The process continues until the end step is reached or until any of the input values are not available, in which case the module is pushed back into the work queue (updating the start step if needed) and another module is pulled from the work queue. Once there is no more modules in the work queue, the workers pause, and the rendering cycle completes. This implementation ensures that workers don't wait unless they have to.

Until recently I never looked at the multithreading implementation that later appeared in VCV Rack (available here), but wanted to run some benchmarks at some point.

During normal opearation, VCV Rack implementation uses spinlocks only. For each step in a rendering cycle, workers process only that single step for each module. Once there are no more modules for a worker to pick up, it will spinwait until all workers have finished, then values are transferred from outputs to connected inputs, and the workers are woken up to process the next single step. This implementation causes the workers to wait a lot instead of possibly processing next steps for some modules.

Now to the benchmark. I used the current miRack code and the latest VCV Rack code. All graphics rendering was disabled, as well as audio output. For VCV Rack, updating port lights was also disabled - it involves a lot of computations that substantially affect the results while not being related to audio processing.

The audio engines were told to process 1024 samples (steps) as fast as they can, and it was repeated 1000 times for a single thread then for 2, 3, and 4 worker threads. The tests were performed on a CPU with 4 physical cores. The following patches (by VCV Rack Ideas were used):

Patch 1

Patch 2

1st Patch Results

Threads	miRack Time	miRack %	VCV Rack Time	VCV Rack %	Ideal %
1	4242ms	100.00%	6313ms	100.00%	100.00%
2	2236ms	52.71%	5179ms	82.04%	50.00%
3	1620ms	38.19%	4604ms	72.93%	33.33%
4	1341ms	31.61%	4312ms	68.30%	25.00%

2nd Patch Results

Threads	miRack Time	miRack %	VCV Rack Time	VCV Rack %	Ideal %
1	4904ms	100.00%	6203ms	100.00%	100.00%
2	3054ms	62.28%	4944ms	79.70%	50.00%
3	2575ms	52.51%	4578ms	73.80%	33.33%
4	2357ms	48.06%	4455ms	71.82%	25.00%

"%" column shows time difference to the single-threaded case, and "Ideal %" shows the best theoretically achievable improvement of N times for N threads.

Also I should note that initially it was about comparing multithreaded speed increase, not absolute values (at least because miRack and VCV Rack use different versions of some of the patch modules), but absolute values turned out to be quite interesting as well. As I mentioned above, port lights update code adds about another second to VCV Rack results.