Extrapolating from a Single Image to a Thousand Classes using Distillation
Yuki M. Asano*
Aaqib Saeed*
*Equal Contribution

[Paper]
[Code & pretrained models]

Extrapolating from one image. Strongly augmented patches from a single image are used to train a student (S) to distinguish semantic classes, such as those in ImageNet. The student neural network is initialized randomly and learns from a pretrained teacher (T) via KL-divergence. Although almost none of target categories are present in the image, we find student performances of > 66% for classifying ImageNet's 1000 classes. In this paper, we develop this single datum learning framework and investigate it across datasets and domains.

Key contributions

  • A minimal framework for training neural networks with a single datum from scratch using distillation.
  • Extensive ablations of the proposed method, such as the dependency on the source image, the choice of augmentations and network architectures.
  • Large scale empirical evidence of neural networks' ability to extrapolate on > 12 vision and audio datasets.
  • Qualitative insights on what and how neural networks trained with a single image learn.

Abstract

What can neural networks learn about the visual world from a single image? While it obviously cannot contain the multitudes of all existing objects, scenes and lighting conditions – within the space of all 2563x224x224 possible 224- sized square images, it might still provide a strong prior for natural images. To analyze this hypothesis, we develop a simple framework for training neural networks from scratch using a single image and knowledge distillation from a supervised pretrained teacher. With this, we find the answer to the above question to be: surprisingly, a lot. In quantitative terms, we find accuracies of 94%/74% on CIFAR10/100, 66% on ImageNet, and by extending this method to video and audio, 51% on Kinetics-400 and 84% on SpeechCommands. In extensive analyses we disentangle the effect of augmentations, choice of data and network architectures and also discover panda neurons in networks that have never even seen one. This work demonstrates that one image can be used to extrapolate to thousands of objects classes and motivates research on the fundamental interplay of augmentations and images.


Selected Results

Distilling dataset. 1 image + augmentations ≈ almost 50K in-domain CIFAR-10/100images.

Distilling source image. Content matters: less dense images do not train as well.

Distilling audio representations. Our approach also generalizes for audio by using 1 audio clip + augmentations.

Analysis of IN-1k model distillations. We vary distillation dataset and teacher and student configs. We achieve 66% top-1 single-crop accuracy on IN-1k. See more details in the paper.

Visualizing neurons. We find neurons that fire for objects the network has never seen.



Our training data

Training data. We generate a dataset from a single datum and use it for training networks from scratch.



Paper and Supplementary Material

Y. M. Asano, A. Saeed
Extrapolating from a Single Image to a Thousand Classes using Distillation
(hosted on ArXiv)


[Bibtex]


Acknowledgements

Y.M.A. is thankful for MLRA funding from AWS. We also thank T. Blankevoort, A.F. Biten and S. Albanie for useful comments on a draft of this paper.