Batch- vs Stream-Processing: Distributed Computing for Biology
At Recursion, we’re finding cures for rare diseases by testing drug compounds against human cells, en masse. Through machine learning approaches, our data scientists figure out which drugs are effective. But before they can get their hands on the data, the engineering team has to shuttle raw data from the lab into the cloud and process it along the way.
Every week, our biologists run experiments on 384-well glass-bottom plates full of cells and then we snap high-resolution images of the cells under a microscope.
Now, suppose you have a bunch of cells, and you want to quantify the geometric features about them: their roundness, the thickness of their membrane, the shape of their mitochondria, and so forth. Data scientists need these cellular features to figure out which drugs are working.
“No problem,” you think. You work with a bunch of talented microscopists who add several stains, molecules which adhere to certain features of the cell and reflect distinct wavelengths of light. As you run the cells through the microscope, you can snap several photos at several different wavelengths.
This gives you a set of images representing the same set of cells. Now you need to extract features from those images. We use two approaches for this; the first is a more algorithmic, feature-based approach using CellProfiler, an open-source program designed to, well, profile cells. The second is our deep-learning approach, in which we train neural nets to discover cellular features humans might not have considered.
Now, both of these approaches are very CPU- and memory-intensive. If you’re running a relatively small experiment, that might not be such a problem. But our lab can produce up to 11TB of images per week (and has plans to scale bigger). So it’s time for a little distributed computing.
Now, originally, each experiment we ran required slightly different configurations for CellProfiler. We finished an experiment, looked at a random sample of the images we’d acquired, and built a bespoke model to work well for that particular experiment. So effectively, our approach was:
“I am going to upload all of the images to the cloud and then process them all at once!”
As our model has gotten better, we no longer see significant benefit to these adjustments. Now we just run a standard suite to profile our cells. Moreover, we process an order of magnitude more images. So this approach ends up being a little too naïve for what we need to do.
Our initial implementation looked something like this:
We run a daemon on a machine attached to the microscope. It watches the filesystem and when it sees a new image appear, it passes it to an image microservice.
This microservice stashes the image in Amazon’s S3. It checks the experiment metadata we have stored in our database and determines whether the experiment is “done”, i.e. whether we have all of the images we expected to have, based on the size of the experiment. If so, we add an “experiment done” message to a queue (we use Amazon’s SQS for this).
A distributor service watches for this message and then spins up machines in our Kubernetes cluster. It computes the number of boxes it will use (up to a max of 500) by dividing the number of images by the historical average processing time such that processing finishes in under an hour.
This approach worked well for years, but as we started to scale our lab, we started to see a number of problems with this approach:
1. It’s not as fast as it could be
Since batch processing can’t start until the entire experiment has been imaged, we have hours of idle time wherein nothing is being processed.
2. It can’t cope with failure
Suppose a lab technician drops an experiment plate, or we have networking problems which prevent an image from being uploaded, or the microscope cannot save the image because our file system is full, or our image microservice is being DDOSed by terrorists who don’t want us to cure diseases. All of these things have happened to us at some point (well, except the last one). And in these cases, our image processing microservice can’t register the experiment as “done” since it’s waiting for a missing image. So, processing never gets kicked off.
Similarly, when we have issues with a particular box, it’s not always easy to rerun just the failing portion of the experiment. Our implementation expected to be invoked for an entire experiment, not a particular image or set of images.
3. It can’t react to experimental mistakes
Sometimes there are issues in the lab that require us to abort or re-run an experiment. For example, there could be contamination in a portion of the experiment, or the microscope could be out-of-focus. We’d like to know about these problems before spending a lot of money on processing, but with our batch implementation, they are invisible until the entire experiment has been completed.
4. It’s expensive
Our pipeline uses cloud infrastructure products in conjunction from both Amazon and Google. We spin up thousands of high-memory instances for computation and have storage costs for images at rest. This gets expensive quickly. So it behooves us to avoid processing images when we don’t have to, such as when there are microscope focus issues, as mentioned above.
In summary, all of these problems stem from the way we initially thought about the problem: “We’re going to get an experiment and batch-process it in parallel.” What we actually want is to process the experiment as a stream of work units. So we came up with a better way to think about the problem:
“What is the smallest unit of work a delegate can do?”
In our case, the smallest unit of work is not a single image, since some of the processing we’d like to do requires that we have the entire set of images across all wavelengths for a given location in a microarray plate. But we don’t need to wait for the entire experiment (which consists of many plates) to be imaged; instead, once our image-uploading microservice sees the requisite images for a given site, we can send a message to our queue.
The role of the distributor then becomes much more simple: read a unit of work from the queue and spin up a box to process it. If we’ve hit our maximum, wait until a machine is free before trying to read from the queue.
This can give feedback to our lab technicians earlier in the process; when they see microscope focus issues in one of the wells of the microplate, for example, they can stop processing the rest of the experiment by emptying the queue and re-imaging the experiment. Moreover, we save a considerable amount of money that would have been wasted on processing experiment plates that will need to be re-imaged anyway.
Changing our approach from a batch-processing to a stream-processing model has saved us an enormous amount of time and money. It’s a question I explicitly ask myself these days when I need to think about distributed computing: how can I reduce the sort of task that can be done independently to be as small a unit of work as possible? And how early can I start it?