Uncertainty in Deep Learning
(PhD Thesis)
October 13th, 2016
Thesis: Uncertainty in Deep Learning
Some of the work in the thesis was previously presented in [Gal, 2015; Gal and Ghahramani, 2015a,b,c,d; Gal et al., 2016], but the thesis contains many new pieces of work as well. The most notable of these are
- some discussions: a discussion of AI safety and model uncertainty ($\S$1.3), a historical survey of Bayesian neural networks ($\S$2.2),
- some theoretical analysis: a theoretical analysis of the variance of the re-parametrisation trick and other Monte Carlo estimators used in variational inference (the re-parametrisation trick is not a universal variance reduction technique! $\S$3.1.1–$\S$3.1.2), a survey of measures of uncertainty in classification tasks ($\S$3.3.1),
- some empirical results: an empirical analysis of different Bayesian neural network priors ($\S$4.1) and posteriors with various approximating distributions ($\S$4.2), new quantitative results comparing dropout to existing techniques ($\S$4.3), tools for heteroscedastic model uncertainty in Bayesian neural networks ($\S$4.6),
- some applications: a survey of recent applications in language, biology, medicine, and computer vision making use of the tools presented in this thesis ($\S$5.1), new applications in active learning with image data ($\S$5.2),
- and more theoretical results: a discussion of what determines what our model uncertainty looks like ($\S$6.1–$\S$6.2), an analytical analysis of the dropout approximating distribution in Bayesian linear regression ($\S$6.3), an analysis of ELBO-test log likelihood correlation ($\S$6.4), and discrete prior models ($\S$6.5) as well as an interpretation of dropout as a proxy posterior in spike and slab prior models ($\S$6.6, relating dropout to works by MacKay, Nowlan, and Hinton from 1992).
- Contents (PDF, 36K)
- Chapter 1: The Importance of Knowing What We Don't Know (PDF, 393K)
- Chapter 2: The Language of Uncertainty (PDF, 136K)
- Chapter 3: Bayesian Deep Learning (PDF, 302K)
- Chapter 4: Uncertainty Quality (PDF, 2.9M)
- Chapter 5: Applications (PDF, 648K)
- Chapter 6: Deep Insights (PDF, 939K)
- Chapter 7: Future Research (PDF, 28K)
- Bibliography (PDF, 72K)
- Appendix A: KL condition (PDF, 71K)
- Appendix B: Figures (PDF, 2M)
- Appendix C: Spike and slab prior KL (PDF, 28K)
One of the nice practical new results in section $\S$4.1 for example affects function visualisation. It's a minor change that has gone unnoticed until now, but which is significant in understanding our functions.
Function visualisation
There are two factors at play when visualising uncertainty in dropout Bayesian neural networks: the dropout masks and the dropout probability of the first layer. Uncertainty depictions in my previous blog posts drew new dropout masks for each test point—which is equivalent to drawing a new prediction from the predictive distribution for each test point $-2 \leq \x \leq 2$. More specifically, for each test point $\x_i$ we drew a set of network parameters from the dropout approximate posterior $\boh_{i} \sim q_\theta(\bo)$, and conditioned on these parameters we drew a prediction from the likelihood $\y_i \sim p(\y | \x_i, \boh_{i})$. Since the predictive distribution has \begin{align*} p&(\y_i | \x_i, \X_\train, \Y_\train) \\ &= \int p(\y_i | \x_i, \bo) p(\bo | \X_\train, \Y_\train) \text{d} \bo \\ &\approx \int p(\y_i | \x_i, \bo) q_\theta(\bo) \text{d} \bo \\ &=: q_\theta(\y_i | \x_i) \end{align*} we have that $\y_i$ is a draw from an approximation to the predictive distribution.
This process is equivalent to drawing a new function for each test point, which results in extremely erratic depictions that have peaks at different locations (seen in figure A taken from the previous blog post). Drawing a new function for each test point makes no difference if all we care about is obtaining the predictive mean and predictive variance (actually, for these two quantities this process is preferable to the one I will describe below), but this process does not result in draws from the induced distribution over functions. This is because different network parameters correspond to different functions, and a distribution over the network parameters therefore induces a distribution over functions. Under a Bayesian interpretation, we identify a draw $\boh$ from the posterior over network parameters $q_\theta(\bo)$ with a single function draw. To get a draw from our induced posterior over functions, we would need to sample a single network for all test points then, rather than sample a particular prediction for each test point. To visualise our predictive distribution in a more appealing way we could draw a single network for the entire test set. With dropout, this can be done by drawing a single set of masks to be used with all test points. Our induced functions look very different now (seen in figure B, and with a demo below). In the new visualisation the functions are smooth, even though they are drawn from a dropout approximating distribution (which randomly sets whole rows of the weight matrix to zero). Note that to calculate predictive mean and predictive variance, using different masks is actually preferable since it results in lower variance estimators.Another important factor affecting visualisation is the dropout probability of the first layer. In the previous posts we depicted scalar functions and set all dropout probabilities to $0.1$. As a result, with probability $0.1$, the sampled functions from the posterior would be identically zero. This is because a zero draw from the Bernoulli distribution in the first layer together with a scalar input leads the model to completely drop its input (explaining the points where the function touches the $x$-axis in figure A). This is a behaviour we might not believe the posterior should exhibit (especially when a single set of masks is drawn for the entire test set), and could change this by setting a different probability for the first layer. Setting $p_1 = 0$ for example is identical to placing a delta approximating distribution over the first weight layer.
In the demo below we use $p_1 = 0$, and depict draws from the approximate predictive distribution evaluated on the entire test set $q_{\theta_i}(\Y | \X, \boh_i)$ ($\boh_i \sim q_{\theta_i}(\bo)$, function draws), as the variational parameters $\theta_i$ change and adapt to minimise the divergence to the true posterior (with old samples disappearing after 20 optimisation steps). You can change the function the data is drawn from (with two functions, one from the last blog post and one from the appendix in this paper), and the model used (a homoscedastic model or a heteroscedastic model, see section $\S$4.6 in the thesis for example or this blog post).
Acknowledgements
To finish this blog post I would like to thank the people that helped through comments and discussions during the writing of the various papers composing the thesis above. I would like to thank (in alphabetical order) Christof Angermueller, Yoshua Bengio, Phil Blunsom, Yutian Chen, Roger Frigola, Shane Gu, Alex Kendall, Yingzhen Li, Rowan McAllister, Carl Rasmussen, Ilya Sutskever, Gabriel Synnaeve, Nilesh Tripuraneni, Richard Turner, Oriol Vinyals, Adrian Weller, Mark van der Wilk, Yan Wu, and many other reviewers for their helpful comments and discussions. I would further like to thank my collaborators Rowan McAllister, Carl Rasmussen, Richard Turner, Mark van der Wilk, and my supervisor Zoubin Ghahramani.
Lastly, I would like to thank Google for supporting three years of my PhD with the Google European Doctoral Fellowship in Machine Learning, and Qualcomm for supporting my fourth year with the Qualcomm Innovation Fellowship.
PS. there might be some easter eggs hidden in the introduction :)