The Unforgiving Asymptote of Chasing Machine Learning Gains
1. Executive Summary
Much ink has been spilled on systemic causes of hidden technical debt of machine learning systems (Sculley et al, 2015). Such issues include: “hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.”
While we agree that such issues pose very real challenges, in this blog we focus on a less systemic (though perhaps just as insidious) source of technical debt. Namely, the commonplace hope among machine learning researchers (including this author), that when tackling novel problems or datasets, significant performance gains are merely a couple of easy experiments away. We provide two case studies that illustrate that all too often such an attitude is met with little quantifiable improvement. These findings help inform the optimal use of valuable human time and mental capital.
2. Case Study 1: Synthesizing Robustness Introduction
IQT Labs’ Synthesizing Robustness project sought to quantify whether synthetic data can be used to improve the detection of rare objects. For a detailed synopsis of results see our project writeup. In brief, we explored techniques for augmenting sparse datasets with both real and synthetic data to improve the detection of rare aircraft. Few-shot learning (particularly on the unique RarePlanes dataset) is a complex topic and one we won’t focus on in this blog. Instead, we will focus on the experimental results of Synthesizing Robustness project.
Our goal for Synthesizing Robustness was to maximize the F1-score of a computer vision model designed to detect rare aircraft from satellite imagery. F1-score penalizes both false positives (spurious detections) and false negatives (missed detections), which makes it a good performance metric for object detection. We used the YOLTv4 scalable object detection model for this project. YOLTv4 invests significant effort in pre-processing data for training and post-processing predictions for massive satellite images, but uses the well-known YOLOv4 framework for the deep learning object detection portion.
Our hypothesis for this project was that synthetic data could be used to improve performance, particularly domain adapted (DA) synthetic data. As with any research project, we design a series of experiments to test this hypothesis. Our experiments fall into two camps:
A. Model parameters — We alter deep learning model architecture, as well as configuration parameters such as learning rate, training time, etc.
B. Dataset structure — While the core dataset remains static throughout the project, we nevertheless experiment with various augmentation strategies.
2.1. Rare Aircraft Detection Performance
In Figure 1 we show the evolution of project performance over time. The plot is colored by the seven experiment groups, and the icon displays the data used for the experiment: Real Only, Real + Synthetic, or Real + DA Synthetic. DA Synthetic refers to synthetic data that has been domain adapted to look more like “real” data than the gaming engine created synthetic data. The y-axis is the mean F1-score (mF1) of the predictions on our test set.
The large temporal gap in Figure 1 between Experiment population 2 and 3 is artificial, as there was a significant work stoppage in that March. Plotting this extremely relaxing work stoppage in Figure 2 better illustrates the pace of progress.
We can also estimate the number of hours poured into creating these results, as in Figure 3.
There are some important trends foundt in Figures 1–3. First, we note a significant 25% increase in score between Experiment #1 and Experiment #2 in only a few days. This jump is due to adding the synthetic data to our models. A brief description of each experiment is viewable by clicking on the plot and hovering over each experiment.
A second crucial point: over the next many weeks, we note no significant increase in performance, as Experiments #3, #4, #5, #6 show no improvement over Experiment #2. In this block of Experiments (#3, #4, #5, #6) we undertook several standard techniques to improve model performance: newer and supposedly better deep learning architectures, different training/validation splits, and hyperparameter tuning. Yet these approaches yielded no benefit. In fact, in Experiment group #5, two of our experiments using different deep learning architectures failed completely and yielded no valid prediction (yellow triangles with mF1=0.0 in Figures 1–3).
We only managed a meaningful improvement in Experiment 7 (a 12% improvement on Experiment #2). In Experiments 2–6 the square icons denote models trained with a dataset with 3055 images including both real + DA synthetic data. The red square of Experiment 7 denotes out best model, with a dataset that includes both targeted augmentations of real data as well as domain adapted synthetic data (see the project writeup for more details).
2.2 Optimizing Efforts
We can draw a few conclusions from the trends noted in the previous section.
A. Data quality trumps model architecture/hyperparameters. For this project, tuning the model architecture and hyperparameters had little effect, while strategically augmenting our dataset yielded very significant improvements.
B. Performance improvements rapidly asymptote. Our initial experiments yielded a 25% improvement in ~3 days of labor, or ~8% gain per day. Yet gaining an additional 12% improvement required ~30 days of labor, or ~0.4% performance gain per day.
Regarding point A: many open-source machine learning architectures have become so well-tuned that improving upon the default hyperparameters is a difficult process. In the computer vision domain, these architectures are usually trained on large corpora of mobile phone imagery, very different from the satellite imagery of Synthesizing Robustness. Therefore, the available trained model weights transfer poorly to our domain. Accordingly, focusing on the data (rather than model parameters) proved a better use of time for this project.
Regarding point B: researchers should be prepared for a potentially long slog to eke out a few more percentage points. This is not a new concept of course, but our findings provide another indication that the “magic” of machine learning is still usually accompanied by significant technical debt. There are times when every little bit of performance is crucial, but often labor would be better spent on efforts other than algorithm optimization, such as gathering new volumes or modalities of data. This point is noted with some chagrin, as many data scientists and researchers (the author included) dearly love the challenge of notching performance improvements.
3. Case Study 2: Where in the World
Many of these same trends noted above can also be seen in IQT Labs’ Where in the World (WITW) project. WITW focused on deep learning for cross-view image geolocalization (CVIG), which is the task of geolocating a photo by comparing it to satellite imagery. The project, which is described in a three-part blog series, focused on the use of ordinary, real-world photographs instead of panoramas collected by mapping services. The project produced the WITW dataset and WITW model , a new dataset and model implementation for CVIG.
Figure 4 shows model performance (as measured by top-percentile score) versus time during the WITW project. The timespan shown is approximately one year. Each colored line denotes a consistent set of test conditions.
Because an exact timestamp was not preserved for some tests, we use the location of each record in a researcher’s notes as a proxy for time. This captures the chronological order and gets the temporal spacing approximately correct. The varying rate of notetaking is not enough to affect overall trends.
Over the course of the project, test conditions were repeatedly modified to allow us to use the most realistic (and hence difficult) test case that was possible at the time. The first tests used a simple model (our baseline) and a pre-existing dataset (CVUSA). After the introduction of our WITW model, we could accommodate test cases with limited field of view. Once our WITW dataset was ready, we could test on ordinary photographs, which was the ultimate goal of the project. Figure 4 shows the evolution of model performance under the six sets of test conditions that were most-commonly used. Some test conditions were revisited for comparison purposes even after they fell out of regular use for day-to-day evaluation of possible improvements to the model and/or dataset. But except for the baseline model tests (the leftmost points in Figure 4, shown in blue), all these tests used the same evolving model and reflect its changes throughout the project.
The introduction of a more challenging set of test conditions often allowed us to focus on needed improvements that had been less evident under previous tests. Many of these initial improvements involved identifying differences between our model and a state-of-the-art open-source architecture on which our model was based In our subsequent attempts to derive further performance improvements, one successful method was to improve dataset quality by removing irrelevant images. The effect can be seen in the lower plot of Figure 4. The purple curve (labeled “v01-v02”) shows the unimproved version of the dataset, and the brown curve (labeled “v03-v04”) shows the improved dataset, with the latter out-performing the former. Efforts were also made to improve model architecture. One approach was to introduce additional models to pre-generate semantic information about the images, which ultimately did not increase performance. In an echo of point A discussed above, efforts to improve the dataset proved more fruitful than efforts to improve the model architecture once a state-of-the-art open approach had been fully implemented.
Of the six sets of testing conditions, five show the same basic shape: a rapid rise followed by much-slower growth The “baseline model, CVUSA” curve shows that most of the improvement in the baseline model happened shortly after its initial development at the beginning of the project. The next four testing scenarios show the same trend for the WITW model after its initial development midway through the project. The last test scenario misses the rapid rise only because it came into use after that rise had already taken place for the combination of the WITW model and dataset. For each model, rapid initial improvement led to markedly slower improvement later. In this, the WITW project was consistent with point B from above.
After reviewing IQT Labs’ Synthesizing Robustness and Where in the World computer vision projects, we conclude that while incremental tuning of hyperparameter/architecture yielded few discernible gains, improvements in dataset quality/quantity yielded significant improvements.
While we acknowledge that the conclusions noted above will not be universal for all applied machine learning projects, they do hint at an encouraging situation. Namely, that some machine learning architectures (e.g., object detection in computer vision) are very nearly commoditized and can be adopted confidently, provided input data and predictions are appropriately processed. One caveat of our findings is that in some cases where data is abundant and not properly curated, more data can actually cause problems rather than yield performance improvements. We also note that fundamental (not applied) research has a significantly different value proposition than the applied research we discuss here. For applied research, pre- and post-processing steps may not be quite as sexy or analytical as crafting novel deep learning architectures, but they require less expertise and are often more impactful and far quicker to implement. In the end, focusing on data quality as well as refining model outputs offers the potential of a more rapid convergence to a 90% solution, after which careful consideration will determine if the last drops of performance are worth the squeeze.