CAMOLO — Adversarial Camouflage for Overhead Object Detection
In this blog we explore the efficacy and drawbacks of adversarial camouflage in an overhead imagery context. Note: This blog is a summary of our ArXiv paper.
1. Executive Summary
While a number of recent papers have demonstrated the ability to reliably fool deep learning classifiers and object detectors with adversarial patches (see our previous blog [link] for further discussion), most of this work has been performed on relatively uniform datasets and only a single class of objects. In this work we utilize the VisDrone dataset, which has a large range of perspectives and object sizes. We explore four different object classes: bus, car, truck, van. We build a library of 24 adversarial patches to disguise these objects, and introduce a patch translucency variable to our patches. The translucency (or alpha value) of the patches is highly correlated to their efficacy. Further, we show that while adversarial patches may fool object detectors, the presence of such patches is often easily uncovered, with patches on average 24% more detectable than the objects the patches were meant to hide. This raises the question of whether such patches truly constitute camouflage. Source code is available at https://github.com/IQTLabs/camolo, and full details can be found in our arXiv paper.
2. VisDrone DataSet
For this study we use the VisDrone dataset, specifically the object detection portion of VisDrone (VisDrone2019-DET). This dataset includes 6471 images taken from drones in the training set, with 1610 images in the test set, with bounding box annotations for objects of interest.
The altitude, viewing angle, and lighting conditions are highly variable, which complicates analysis of the imagery (see Figure 1). The 353,550 bounding box labels in the training set tend to be relatively small (median extent of 34 pixels), though the size is highly variable (standard deviation of 44 pixels), which is a due to the variance in altitude and viewing angle of the drone platform. We tile the VisDrone training imagery to 416×416 pixel windows for ease of ingestion into object detection algorithms. For this study we focus on four object classes (see Figure 2): bus, car, truck, and van.
3. Vehicle Detector
We use the YOLTv4 object detection framework to train a 4-class vehicle detector. We use a configuration file with an output feature map of 26 × 26 for improved detection of small objects. Predictions with the trained model are shown in Figure 3. We evaluate YOLT detection model performance with the 1610 images in the VisDrone test set, setting a true positive as a prediction of the correct class with an IOU≥ 0.5. Scores are shown in Table 1; we report 1σ errors calculated via bootstrapping.
4. Adversarial Patches
To train an adversarial patch, we develop the Camolo codebase, which is a modification of the adversarial-yolo codebase. The adversarial-yolo codebase takes a trained model and labeled imagery as input, and attempts to create a patch that when overlaid on objects of interest will fool the detector. Camolo makes a number of modifications:
- Increased flexibility with input variables (e.g. target patch size)
- Use with more recent versions of YOLO
- Allow patches to be semi-translucent
The most significant change (#3) is the method of overlaying patches according to a selected alpha value, which dictates how transparent the patch appears. We postulate that a semi-translucent patch may help camouflage the patches themselves. Previous studies have simply overwritten the existing pixels in an image with the desired patch. We combine the patch and original image pixels according to a desired alpha value of the patch (alpha = 1 corresponds to an opaque patch, with alpha = 0 yielding an invisible patch). In Figure 4 we overlay a sample adversarial patch on VisDrone imagery with both the standard fully opaque method, as well as semi-translucent.
4.1. Adversarial Patch Generation
We train a variety of adversarial patches using the Camolo codebase and VisDrone dataset. All experiments use the same initial dataset and model architecture. We vary the starting patch between experiments, trying both legacy patches as well as totally random starting points. Other variables are the allowed colors of the patches, and the alpha value (translucency) of the patches. The patch size (as a fraction of the area of the object of interest), and noise level is also varied. Finally, we select one of three losses for each experiment: object (focus only on minimizing bounding box detections), class (focus on confusing the class prediction of each bounding box), and object × class. Most experiments aim to yield a non-detection (e.g. “1. obj only v0”) though some experiments seek to confuse which object is classified (e.g. “3. class only v0”), such as classifying a car as a truck. See Figure 5for the trained patches.
In Figure 6 we show successful examples of patches fooling our trained object detector.
Figure 7 shows how mean F1 (mF1) varies with alpha (translucency) and patch size. This plot shows the percentage reduction in vehicle detection provided by the patches. The Pearson correlation coefficient between vehicle detection mF1 reduction and size is 0.83 and the correlation coefficient between alpha and mF1 reduction is 0.76, indicating that larger and less translucent patches are more effective at hiding vehicles.
4.2. Detection of Patches
We showed in Figure 7 that our patches significantly reduce the ability of our trained YOLTv4 model to detect vehicles. In this section, we explore how easy it is to detect the existence of these patches. Recall in from our previous blog [link] that the presence of legacy patches is easily detected. To test the detectability of our patches, we train a generic patch detector by overlaying 10 patches from Figure 5 and overlaying these on the VisDrone training imagery. We train a YOLTv4 model, and test our generic patch detector by overlaying each of our 24 patches on the test set and scoring how robustly the patches can be detected, see Figure 8.
Performance is illustrated in Figure 9. In Figure 9, the green line denotes the detection performance of the original 4-class model on unperturbed imagery; the blue bars denote the performance of the 4-class detection model with the listed adversarial patch applied; the orange bars denote the performance of the model trained to detect the existence of patches. The 10 patches used to train the patch detection model are appended with an asterisk (∗) in Figure 9.
5. Analysis
Note in Figure 9that the orange bars are not significantly higher for asterisked (i.e. training) patches compared to the “unseen” patches. Also note that for most patches, it is easier to detect the existence of the patch than vehicles (orange bars are higher than blue bars).
If we collapse the two performance of the two experiment groups (vehicle detection + patch detection), we are left with Figure 10, which shows a “detection“ score: maximum of the two blue and orange bars in Figure 9. This “detection” score provides a measure of the efficacy of the patch, since an easily detected patch is not terribly effective at camouflage, since the patch itself exposes the presence of the object of interest. Recall that in Figures 9and 10, lower is better. Note also that most patches provide no aggregate benefit since they are above the green baseline. Yet the two black and white patches (obj_only_tiny_gray_v0 and obj_only_tiny_gray_v1) are the most effective. The precise reason for this efficacy will be left to later work, but we postulate that the marked difference of these two patches (i.e. grayscale vs color) from the ten patches in our patch detector training set is largely responsible.
Figure 11 shows how aggregate varies with alpha and patch size. Recall from Figure 7 that larger and more opaque (higher alpha) patches are more effective at confusing the vehicle detector. Yet Figure 11 shows that patches with higher alpha and larger sizes are actually less effective in aggregate performance since the existence of large, opaque patches is far easier to detect. In fact, in terms of aggregate performance, smaller, more translucent patches are preferred (Pearson correlation between detection and alpha: -0.76, Pearson correlation between detection and patch size: -0.83). Ultimately, for true camouflage, one should apparently prioritize patch stealthiness by making them small and translucent.
6. Conclusions
Adversarial patches have been shown to be effective in camouflaging objects in relatively homogeneous datasets such as Inria and DOTA. In this blog we showed that while patches may be effective in hiding people and aircraft, such patches are trivial to detect. This motivates our study of whether “stealthy” patches can be designed to obfuscate objects in overhead imagery. Using the diverse VisDrone dataset we train a library of 24 adversarial patches with various input parameters. While most of these patches significantly reduce the detection of our objects of interest (buses, cars, trucks, vans), most patches are still easier to detect than vehicles. Our two black and white patches are poorly detected with our patch detection model, however, due to their significant variance from the patch training set. This raises the question: how large and diverse of a patch library is required to be truly effective? And how much effort is required on the mitigation side in order to train a robust patch detection model that will effectively combat adversarial camouflage? We have provided some first hints to these questions, but much work remains to be done. Besides diving into these questions, in future work we hope to introduce false positives into imagery (e.g. can a simple pattern laid out in an empty field trick a computer vision model into “detecting” a full parking lot).