On the Quantitative Analysis of Deep Belief Networks (2008)

Authors

Abstract

Deep Belief Networks (DBN's) are generative models that contain many layers of hidden variables. Efficient greedy algorithms for learning and approximate inference have allowed these models to be applied successfully in many application domains. The main building block of a DBN is a bipartite undirected graphical model called a restricted Boltzmann machine (RBM). Due to the presence of the partition function, model selection, complexity control, and exact maximum likelihood learning in RBM's are intractable. Annealed Importance Sampling (AIS), can be used to efficiently estimate the partition function of an RBM. We present a novel AIS scheme for comparing RBM's with different architectures. We further show how an AIS estimator, along with approximate inference, can be used to estimate a lower bound on the log-probability that a DBN model with multiple hidden layers assigns to the test data. This is, to our knowledge, the first step towards obtaining quantitative results that would allow us to directly assess the performance of Deep Belief Networks as generative models of data.

Discussion

Olivier Delalleau, 2008/07/16 15:52

Since it is argued that simple Monte-Carlo estimation would not work to estimate a (ratio of) partition(s) function(s), what's the justification for using such estimation in eq.23 in order to estimate the bound for a DBN?

I can see how it could work for a 2-layer DBN with a small hidden layer h1 as what is used in the experiments, since Q(h1|v) is likely to be close enough to p(h|v). However, as you add more (and bigger) layers, isn't Q going to be driven too far away from P for (23) to be a reliable estimator? If not, why?

Iain Murray, 2008/07/17 13:36

First consider a generic problem that is naturally expressed as an average under a distribution:
Graph or Graph
Example 1: x could be hidden variables, where f(x) is a quantity of interest for a given set of hiddens. Example 2: x could be parameters in a Bayesian inference problem, p(x) the posterior and f(x) a predictive distribution for those parameters.

Simple Monte Carlo often works well, even on complex problems. The assumption is that f(x) does not contain such extreme values that the sum is dominated by a region with tiny probability mass. Notice that in Bayesian inference problems and statistical physics the usual problem is finding a Markov chain to sample from p(x), not that the underlying Monte Carlo estimator doesn't work. Predictions are usually dominated by hiddens/parameters that are reasonably probable, not those hidden in regions of tiny tiny probability mass.

Why does simple Monte Carlo estimation not work for a partition function? The original problem is not naturally an expectation:
Graph
One can squint at it and assert that it is an expectation of exp(-E(x;W)) under the uniform distribution over all x. However, we have intentionally learned parameters W that put the bulk of probability mass on a tiny tiny subset of all values x, the ones that look like our training data. The sum isn't naturally described by typical values from the uniform distribution.

Importance Sampling is a way of coercing a problem to look like an expectation when it didn't look like one to begin with.
GraphLatex
One problem is that occasionally a value of x with low probability under q will be drawn, and 1/q(x) will be very large. The whole construction is set up to create high variance estimators. The situation is saved if p(x) Graph exp(-E(x;W)) is similar to q(x), because then whenever 1/q(x) is huge the “integrand” exp(-E(x;W))/q(x) will be controlled enough to make the sum well described by typical values of q(x).

So, to answer the question: computing Z by Monte Carlo involves an unnatural importance sampling estimator that must be very carefully constructed if typical samples are to describe the sum being approximated. On the other hand, equation (23) is naturally an expectation under a distribution that we can sample from directly. No importance sampling is required. Also the function being averaged, ln p^*(v,h^1), is not pathologically extreme. It might vary over 100's of nats, but not the exp(100's) of the corresponding probabilities. It is the numbers of size exp(100's) that importance sampling cavalierly multiplies and divides by.

You are quite right that as you add more layers Q is driven further from P. The real problem here is that the variational bound may become quite loose, in which case the method will be overly pessimistic about the DBN's test performance.

Olivier Delalleau, 2008/07/18 11:00

Thanks, it does make sense :) What I had mostly failed to notice is how eq. 23 was computing an expectation of log-probabilities, which behaves much more nicely than probabilities themselves.

Enter your comment (wiki syntax is allowed):
DUMPF
 
paper/2008/573.txt · Last modified: 2009/05/24 18:48 (external edit)
 
Driven by DokuWiki