The Augmented Image Prior:
Distilling 1000 Classes by Extrapolating from a Single Image

Yuki M. Asano*

Aaqib Saeed*

*Equal Contribution

ICLR 2023 Paper

[Paper]

Extrapolating from one image. Strongly augmented patches from a single image are used to train a student (S) to distinguish semantic classes, such as those in ImageNet. The student neural network is initialized randomly and learns from a pretrained teacher (T) via KL-divergence. Although almost none of target categories are present in the image, we find student performances of > 69% for classifying ImageNet's 1000 classes. In this paper, we develop this single datum learning framework and investigate it across datasets and domains.

Key contributions

A minimal framework for training neural networks with a single datum from scratch using distillation.
Extensive ablations of the proposed method, such as the dependency on the source image, the choice of augmentations and network architectures.
Large scale empirical evidence of neural networks' ability to extrapolate on > 12 vision and audio datasets.
Qualitative insights on what and how neural networks trained with a single image learn.

Abstract

What can neural networks learn about the visual world when provided with only a single image as input? While any image obviously cannot contain the multitudes of all existing objects, scenes and lighting conditions – within the space of all 256^3x224x224 possible 224-sized square images, it might still provide a strong prior for natural images. To analyze this augmented image prior hypothesis, we develop a simple framework for training neural networks from scratch using a single image and augmentations using knowledge distillation from a supervised pretrained teacher. With this, we find the answer to the above question to be: surprisingly, a lot. In quantitative terms, we find accuracies of 94%/74% on CIFAR-10/100, 69% on ImageNet, and by extending this method to video and audio, 51% on Kinetics-400 and 84% on SpeechCommands. In extensive analyses spanning 13 datasets, we disentangle the effect of augmentations, choice of data and network architectures and also provide qualitative evaluations that include lucid panda neurons in networks that have never even seen one.

Talk

Selected Results

Distilling dataset. 1 image + augmentations ≈ almost 50K in-domain CIFAR-10/100images.

Distilling source image. Content matters: less dense images do not train as well.

Distilling audio representations. Our approach also generalizes for audio by using 1 audio clip + augmentations.

Larger scale datasets. Our method scales to larger models utilizing 224×224-sized images.

Analysis of IN-1k model distillations. We vary distillation dataset and teacher and student configs. We achieve 69% top-1 single-crop accuracy on IN-1k. Even with just argmax (AM) signal, performance remains high at 44%. See more details in the paper.

Varying student width or depth. We find wide models benefit more from parameter-increase an even reach the teacher's performance on ImageNet.

Visualizing neurons. We find neurons that fire for objects the network has never seen.

Our training data

Training data. We generate a dataset from a single datum and use it for training networks from scratch.

Paper and Supplementary Material

Y. M. Asano, A. Saeed
The Augmented Image Prior: Distilling 1000 Classes by Extrapolating from a Single Image
(ICLR 2023)

[Bibtex]

Acknowledgements

Y.M.A. is thankful for MLRA funding from AWS. We also thank T. Blankevoort, A.F. Biten and S. Albanie for useful comments on a draft of this paper.