In this blog series, we summarise our favourite non-conventional machine learning and artificial intelligence papers.
Distilling the Knowledge in a Neural Network
Hinton, Vinyals, & Dean (2015).
In 2015, Google Brain presented the machine learning technique of distillation, a novel approach to reduce model size and complexity. The general motivation was to distill the information contained in a more cumbersome model into a ligher model. We could then reduce an ensemble of models or a very deep neural network, into a lightweight model, that could be used for evaluation tasks. The core insight is the softmax logit of the network contain additional information (such as ambiguity, and uncertainty) about the predicted labels. Thus, the predictive distribution (over labels) that complicated models learn contain valuable information that can be exploited by other models.
Hinton, Vinyals, and Dean evaluate the technique of distillation in the contexts of image classification, speech, speech recognition, and ensemble learning. The authors also examined distillation as a regulariser. Applied to MNIST, training a smaller network on the softmax predictions of a larger network (rather than the direct labels) was shown to approximately cut in half the number of misclassified images. In the context of speech recognition, the averaged predictions across an ensemble of models was used to train a single model. The performance of the distilled model was better than any individual model in the ensemble, and only slightly worse than the ensembles performance.
Our opinion: We often throw out the ambiguity captured in a models predictive distribution. Viewing this distribution as a source of information is a key insight.
The Mechanics of n-Player Differentiable Games
Published in June 2018, the Deepmind based team look at a series of toy problems within game theory through the lens of optimisation. Their first key contribution is to note that the dynamics of a game (how players update their strategy) can be seen as traveling on the level set of a corresponding Hamiltonian (and is thus the equivalent of a conservation law in Physics). Even more interestingly, they notice that finding the minimum on this Hamiltonian is equivalent to finding a Nash Equilibrium in the loss space.
Motivated by these observations, the authors use Hermoltz decomposition to resolve games into two types: potential and Hamiltonian games. The first is well explored whilst the second relates to the aforementioned contribution. With these new observations, the authors propose a new form of optimisation called Symplectic Gradient Descent — which is able to find stable fixed points in general games via the aforementioned Hamiltonian.
Our opinion: Unlike most related work in this field, the authors produce an optimiswation technique which is applicable to general games, not merely two player games such as GANs.
Meta-Learning for Semi Supervised Few Shot Classification
Few shot classification looks to move machine learning away from its dependency on large datasets, instead asking models to be able to correctly learn concepts from a handful of example in the way we recognize humans do. In this paper, the authors focus on refining Prototypical Networks. These networks are presented with a small set of labelled training images, and produce an embedding for each datapoints. These embedded datapoints are then clustered and prediction is equivalent to finding the nearest cluster in this embedding space. A model initialisation, sampling of training images and the associated training is considered a single episode. End-to-end training is performed over multiple episodes (with each new episode randomised), thus the model learns how to efficiently one-shot learn for over all episodes — a meta-learning strategy.
The authors decide to augment the training set to contain unlabelled images. These images may relate to the classes of other images, or be distractors — completely new images. This approach is compared to semi-supervised learning. The authors deal with distractors by three techniques: (1) predicting their classes and incorporating them into the current clusters by soft K-means, (2) clustering all distractors into a common class and (3) introducing a single multi-layer perceptron as a mask over which to apply soft K-means.
Our opinion: This helps extend models to more realistic situations — where all data may not be labelled. However, when evaluating the performance improvements, the results are only marginal.
Adversarially Augmented Adversarial Training
The authors describe a methodology to improve classifier robustness in the face of adversarial attacks designed to deliberately induce mis-classifications. The authors suggest training an auxiliary deep learning network on recognizing adversarial noise. Rather than assessing the
original input, for instance a facial image, the discriminator learns to recognise and filter out adversarial noise in one of the hidden layers of the classifier.
In this article, the authors describe a simple experiment where a classifier and a discriminator are co-trained, the input of the discriminator being a hidden layer of the classifier. The two networks are then supplied with real and perturbed MNIST data, whereupon the discriminator learns which noise to filter out. They show that the robustness of the classifier improves when coupled with an adversarial filter, resulting in a reduced degree of mis-classifications.
Our opinion: This is a promising first step in solving the problem of high variance in inputs and potential adversarial attacks in online security, facial recognition and computer vision in general. In their proof of concept, the addition of an adversarial discriminator increased the accuracy from 25% to 96%.
This post was written by Akbir Khan, Sean Billings and Sean Hooper — Research Engineers at the Spherical Defence Labs.