Performance

Audio systems that run in realtime can be tricky to optimize for performance. In general, they have the following characteristics:

  • Bounded Latency: We must deliver samples to the sound card at regular intervals. We can not process ahead of time when running in realtime so as to allow for changes to the audio graph (adds/removes) via user interaction and/or scheduled events. If we do not deliver to the sound card in time, we get buffer underruns and sound will break up, which is a failure in the system. Backpressure–that which regulates our production rate at realtime–is supplied by the audio driver. For example, in OpenJDK, the OSX audio driver for JavaSound uses CoreAudio and has mutex locking protecting reads from and writes to a ring buffer. It is the locking mechanism that prevents us from processing too far ahead of time.
  • Low-Latency: The amount of latency is determined by the size of the buffer we regularly send to the sound card. The larger the buffer size, the higher the latency between when audio is produced and when it is heard. The lower the buffer size, the shorter the latency, but also the shorter the time we have in which to produce audio. A target of ~1.45ms (64 samples at 44100 sampling rate) should be considered an ideal target, though higher sizes are more practical (i.e. ~5.8ms, 256 samples at 44100 sr). (Pink is configured to use 256 by default except for on Windows, which uses 1024, as the JDK uses the older DirectSound API that does not support low-latency on Windows.)
  • Throughput: In general, for signal processing systems, we schedule on-going processes as part of the audio graph or control graph. Each process adds computations. For the control graph, each function is processed once per buffer. For the audio graph, each function must generate buffer-size’s worth of information. Ultimately the amount of processing is a factor of the sample rate (i.e. 44100 samples were second). Therefore, each part of the audio graph has a high cost as their operations are multiplied by the sample rate. In some ways, we can measure the throughput as a matter of how deep can the audio graph be extended before running into CPU limits.
  • CPU-Bound: For the most part, operations in audio systems are CPU-bound. Network audio and systems using large amounts of samples streaming from disk (i.e. multi-gigabyte sample sets), as well as those writing a large number of channels to disk may become IO-bound.

The following discusses aspects of performance that have been worked on in Pink.

Numeric Performance

For numeric performance, Pink audio code is written with the goal to get as close to the same performance as if it was written natively in Java. For Clojure code, Pink uses type hints wherever necessary to ensure that math operations are done using Java primitives (i.e. double and long, rather than Double and Long). Regular checking of generated byte code using “javap -c” with compiled classes is recommended when developing audio code in Pink to ensure no hidden auto-boxing is occuring in the generated invoke or invokePrim methods. The no.dissassemble library is an excellent tool that can be used for checking bytecode as well.

In general, development of Pink is done using *warn-on-reflection* set to true as well as *unchecked-math* set to :warn-on-boxed. This helps a great deal to find most places where reflection or boxing might occur. However, some things like = and not= for comparing numbers do box primitives and are not covered with those flags, so checking byte code is still recommended after developing audio code to double check for performance hotspots. However, Pink does not try to depend on those flags being used by the end-user, as the user may have uses for the normal Clojure numeric tower for their own musical calculations.

Also, as array indexing in Java only supports the use of integers, Pink’s generator macro will do a single coercion of the long indx variable into int-indx, so that all operations within the audio code’s loop can share the single int. This will result in less l2i bytecode being generated.

Memory Allocation and Garbage Collection

Due to low-latency requirements for audio, special care is required in dealing with the JVM’s memory system. In general, having a garbage collection model offers tremendous benefits for audio programming. Memory allocations are very cheap, often noted as “just a pointer-bump” (reference to be inserted here). This alleviates a large headache of having to be careful about where and when to allocate memory in audio code. Because this is built-in to the JVM, this means we do not have to worry about implementing our own GC or real-time safe memory allocators for our audio systems. This is a tremendous win.

On the other hand, using a garbage collector does have a cost. Usually this cost is amortized for general programming situations where latency requirements is unbounded or not so low. However, for audio systems, the cost of GC is something to be very much aware of.

In general, using G1, there are two GC’s to consider: the first is the young generation collector, the other is the old generation collector. Regardless of GC model used on the JVM, all current GC’s have stop-the-world time. While these systems have developed a great deal over time, in general, they do not seem to target the kinds of latency requirements necessary for realtime audio systems, though they can work out fine if object allocation rate is managed.

For Pink, the approach of the system design is to optimize against the costs of the young generation collector. For audio, most calculations are transient: we generate one buffer of audio and send it to the sound card, and repeat this process. After delivering the buffer of audio to the card, we no longer use this information and move on in time to the next buffer of audio.

For the young generation collector, the cost of collection is linear to the size of live objects, as only live objects are copied to the to-space. Dead objects then are simply ignored and the memory space is implicitly free to reuse. This works tremendously well as most objects die young and not a great deal of time is spent during collection.

Tuning the collector for young generation can be tricky. We can reduce the size of the portion of the heap allocated to the eden space, which would lower possible pause time in the collector, but at the expense of possibly higher frequency of triggering the collector. The frequency of triggering the collector also has a side effect of making objects age faster and move into the old generation, as their age is measured by number of collections they survive.

In Pink, unit generators reuse buffers (double arrays) between calls. When a unit generator is called, it returns new values using the pre-existing buffer. (This is done within a processing model that has a context of time that allows for a safe practice of using mutable objects, promoting single-writer, multiple-readers.) This a trade-off. If new buffers are generated on each call, the object allocation rate would be very high. For example, with a buffer-size of 64 samples, each mono-signal unit generator would generate a 64-length double array per-call, amounting to 64 * 8 bytes = 512 bytes allocated per call (plus additional memory for the size of the object header). This means for every unit generator used, they would generate ~360K per second, or ~22M per minute. As the audio graph can be very deep with the number of unit generators used, the allocation rate can become very high, very quickly, and GC pauses triggered at a very high rate. The downside is that by holding on to a buffer is that the object can become part of the old generation. In general, the current design is to favor lowering the GC frequency in the young gen. This seems to work well with short-time tests less than 10 minutes in running, but long-time testing is required to see the viability of this design for using Pink with things like long-running installations.

To note, the attention to memory and GC is done here in Pink with the assumption that higher-level parts of a music system’s architecture (i.e. user-code building on Pink), will have greater freedom to develop without such a high concern for these details. Benchmarking and performance monitoring is still an ongoing research area for Pink, and the above assumptions and approaches represent the state of thinking and design for Pink.

GC Calculations

Currently, development of Pink is done with the following calculations for what is “safe” in regards to the garbage collector. To achieve the desired behavior where GC will not interfere with the audio engine, a Pink project should aim to render n cycles of the audio engine within the time for the samples generated per trip to the soundcard, minus the maximum observed stop-the-world time of the young-generation GC for the system.

For example, with Pink’s default values on OSX, the hardware buffer size is 256 samples, the internal block-size is 64 samples, and the sample rate is 44100 samples per second. At this rate, the engine renders four loops to fill the external buffer before sending off to the sound card. 256 samples at 44100 hz is equivalent to ~5.8 ms. On my laptop (Macbook Pro 2011 13", Core i7), stop-the-world times around 2.2 ms were observed for a small project. That would leave roughly 3.6ms to generate 4 internal buffers, or one external buffer of time.

Using a window of time equal to the external buffer-size minus the stop-the-world time gives the maximum time in which the engine could calculate, have it be interrupted at any time by the GC, and still meet the time to send to the buffer to the soundcard. Note these calculations assume that the frequency of the GC is reduced such that it will not trigger more than once per window of time to deliver to the soundcard. (i.e. GC won’t happen twice within ~5.8ms in the example above.)

Because there are a number of variables involved, users who do run into issues where the GC does interrupt the audio can try increasing the hardware buffer size for the Pink engine. This will give a larger window of time to deliver to the soundcard at the expense of increased latency. Another approach is to try to lower the CPU cost of the project, whether it is by optimizing calculations or using less CPU-intensive calculations (i.e. using linear interpolation versus cubic interpolation). Also to note, because the GC calculation is depndent upon the CPU, a project’s performance will vary with the hardware it is run on. This should be accounted for when moving a project from one machine to another.