At Butterfly we develop whole-body ultrasound devices that clinicians use for performing abdominal scans, cardiac exams, vascular access, and everything in between. The Butterfly iQ device is made up of an ultrasound transducer that plugs into a mobile device (iOS or Android).
As a medical device startup trying to re-invent Ultrasound within a competitive field, Butterfly emphasizes moving fast: building prototypes, incorporating feedback from our clinical team and the market, and shipping to our users.
Unfortunately, iterating quickly on a mobile medical imaging device requires overcoming two major hurdles:
- The computational resources are limited.
- Scientists and engineers must be able to develop and iterate on prototypes that can swiftly turn into a product.
To overcome these challenges, Butterfly builds upon general-purpose computing on graphics processing units (GPGPU) - using graphics processing units for non-graphics computation. It may seem surprising that this technique accelerates development at Butterfly. GPU programming has a reputation of being laborious and requires expert training. So, how do GPUs allow Butterfly to move fast?
Overcoming the compute resource limitation on mobile devices with GPUs should not come as a surprise. In Ultrasound, and in fact in medical imaging in general, the computational problems often require applying identical mathematical operations on many bins, samples or pixels; One usually gathers data from a sensor or an array of sensors and wants to compute an image or video. In Ultrasound, this process is called “beamforming” - or more generally “image reconstruction”.
For the most part in beamforming, one computes distances, interpolates, filters, and sums. It turns out that is exactly what video-game engines do as well! So we can use the graphics processor designed for video games to create Ultrasound images. GPUs are really good at that.
From Prototype to Product
With the computational power problem overcome, we face a new challenge. Developing beamforming algorithms on a mobile device can be painful. Iteration time can be slow (the modify-code --> recompile --> test cycle can take a couple of minutes per iteration) and debugging on the device can be tricky in general - especially if an Ultrasound sensor is plugged into the USB port.
Consequently, our idea is to prototype algorithms not on a mobile device but on a desktop computer where scientists can iterate quickly. This is not unusual - the practice of building prototypes in Matlab or Python is very common in medical imaging.
The key now is to involve GPGPU as early as possible during the prototyping process. The advantages are 1) GPU implementations are usually faster which makes iterating on the details easier and 2) GPGPU code can be re-used directly in the product.
If we look at the landscape of GPU compute languages available per platform, it becomes clear how this is possible. We observe that the common denominator is C.
GPU compute language review:
- MacOS: OpenCL (C99 if OpenCL < 2.2 else C++11), Metal (C++11)
- Linux: CUDA (C++11) and OpenCL (C99 if OpenCL < 2.2 else C++11)
- iOS device: Metal (C++11)
So we came up with the following scheme to go from Prototype to Production:
- A prototype in Python is developed, using Numpy, Scipy and any other Python library that might be useful - this allows us to implement new imaging modes and features in hours vs. many days or even weeks
- Once the prototype looks promising, the core operations (the compute-heavy steps) are ported to OpenCL using PyOpenCL - when writing the kernels, we already consider that we will later want to run those kernels on a mobile device (see section C-based Kernel Language below)
- Once clinical feedback is positive, the orchestration code, i.e. the code that calls kernels and copies memory is ported to C++ using our GPGPU library Aura (see section Cross Platform Memory and Kernel Orchestration)
- Finally, to go to the mobile platform we only have to cross-compile the code.
The following illustration depicts these 4 steps, what happens at each step and how one transitions from step to step.
Note that the two implementations in 1 (Numpy) and 2 (OpenCL) in Python can be used to trivially implement unit or integration tests. The Python prototype acts as a behavioral model and can be compared against the accelerated implementation. This increases confidence in the code and can prevent regressions.
C-based Kernel Language
We mentioned above that C is the common denominator for our beamforming code. There are differences in kernel calling conventions between OpenCL and Metal, but we handle those with the C preprocessor. We explicitly chose the C preprocessor, instead of a custom code generator, as it is ubiquitous and the transformations we require to support OpenCL and Metal are not complex.
We can see that with the help of the preprocessor we can write one kernel that works for both OpenCL and Metal. A shared header simplifies some of this by creating polyglot implementations of most functions and types. Here is an excerpt:
Cross Platform Memory and Kernel Orchestration
Now that we found a simple way to share kernels between OpenCL and Metal and thus Python, C++ on MacOS/Linux and on iOS, we must discuss kernel orchestration code. This is code that implements required memory copies to and from GPUs, as well as launching kernels and waiting for them to complete.
We developed a library called Aura that abstracts all the kernel, memory, and synchronization orchestration that the different GPU backends expose into a consistent API. Aura is a header-only C++ library - the main design goals are for it to be user-friendly for developers as well as for Aura to vaporize after the compiler is done with it, i.e. there should be zero overhead over using the backend API directly.
The code snippet below shows the canonical example of the Aura host API to orchestrate memory allocation, kernel invocation and synchronization. First, an aura environment object and a device is created. A device represents compute hardware that can execute kernels asynchronously.
Next, we create a feed for a device. You can think of a feed as a queue of kernel invocations and memory copy commands that assures in-order execution of commands added to the feed. Furthermore, it allows synchronization (ensures all commands on the feed have finished).
We then load a library - a set of kernels we want to compile and execute. The library creation compiles the kernel if necessary. From the library, we select a single kernel (add) and create a kernel object.
Now, we want to prepare memory in the calling (host) memory space (std::vector) and in the device memory space (aura::device_array). We then copy the host memory to the device memory. Notice that the copy command requires a feed object. This indicates that the copy command is asynchronous - it is not guaranteed to have completed after the function call returns.
To call a kernel we need to bind execution information to it - how many instances of the kernel should be launched and what feed should it be launched on. The number of instances that should be launched is defined by mesh (the total number of kernel instances to be dispatched) and the bundle (a group of kernel instances that can share high-speed local memory and can be synchronized together).
The result of this binding step is a kernel_instance that can now be invoked.
Listing 3 runs untouched on all platforms (MacOS, Linux, iOS, Android) and all backends (OpenCL, Metal, CUDA, and Vulkan) that Aura supports. The platform is selected with a preprocessor directive:
We discussed how Butterfly uses GPGPU to develop image processing features and move quickly from a prototype to a feature that can be shipped to a customer (after proper validation and verification). We make use of the extensive Python ecosystem to iterate quickly on ideas, then port to a Python-based OpenCL implementation. We are able to use the essence of the OpenCL implementation (the compute kernels) in a C++ prototype as well as the mobile-device production software. We also introduced a GPGPU orchestration library called Aura that simplifies writing GPGPU orchestration code for the prevalent platforms Metal, CUDA, Vulkan and OpenCL.
We have glanced over some things here - one of the big questions you might have are: Android doesn’t support any of OpenCL, Metal, CUDA, so how do we do this on Android? Stay tuned for that in our next article.