A Product of Experts model (PoE) (Hinton 2002) combines a number of individual component models (the experts) by taking their product and normalizing the result. Each expert is defined as a possibly unnormalized probabilistic model \(f(x)\) over its input space.
\[\tag{1} P(x|\{\theta_j\})=\frac{1}{Z}\prod_{j=1}^M f_j(x|\theta_j) \]
with \( Z=\int\mbox{d}x~\prod_{j=1}^M f_j(x|\theta_j)\ .\)
PoEs stand in contrast to Mixture Models which combine expert models additively,
\[\tag{2} P(x|\{\theta_j\})=\sum_{j=1}^M \alpha_j p_j(x|\theta_j) \]
where each component model \(p_j(x)\) is normalized over \(x\) and \( \sum_{j=1}^M ~\alpha_j = 1 \)
Note that Mixture of Expert Models are usually associated with conditional models where the experts are of the form \(p(y|x)\) and the mixture coefficients \(\alpha(x)\) (known as gating functions) may depend on \(x\) as well. Conditional PoEs may be defined as well.
One can qualitatively understand the difference between mixtures and products by observing that a mixture distribution can have high probability for event \(x\) when only a single expert assigns high probability to that event. In contrast, a product can only have high probability for an event \(x\) when no expert assigns an especially low probability to that event. Hence, metaphorically speaking, a single expert in a mixture has the power to pass a bill while a single expert in a product has the power to veto it.
Put another way, each component in a product represents a soft constraint, while each expert in a mixture represents a soft template or prototype. For an event to be likely under a product model, all constraints must be (approximately) satisfied, while an event is likely under a mixture model if it (approximately) matches with any single template. Hence, to sculpt the overall probability distribution of a mixture, each expert adds a lump of probability mass which is usually localised to one region, while to sculpt the overall probability of a product, each expert scales the probability at each point by a different factor. A product can also be viewed as adding together lumps in the log probability domain. This essential difference can result in much sharper boundaries, especially for high-dimensional input spaces (Hinton 2002).
Contents |
Given data \(\{x_n\},~n=1..N\) with \(x\in {R}^d\ ,\) one can use the log-likelihood as an objective function to train a PoE,
\[\tag{3} L(\{\theta_j\}|\{x_n\})=\sum_{n=1}^N\sum_{j=1}^M \log f_j(x_n|\theta_j)-N\log Z \]
Denoting the gradient of the
objective w.r.t. \(\theta_j\) with \(\nabla_j
L\) one can compute the following gradient,
\[\tag{4} \nabla_jL = \sum_{n=1}^N \nabla_j \log f_j(x_n) - N \langle \nabla_j \log f_j(x) \rangle_{P(x)} \]
where
\(\langle\cdot\rangle_{P(x)}\) denotes taking the average
w.r.t. \(P(x)\ .\) Learning is achieved by changing the
parameters incrementally according to the following update rule,
\[\tag{5} \theta_j \rightarrow \theta_j + \eta \nabla_j L \]
where \(\eta\) represents the learning rate. Learning
efficiency can usually be improved by using a stochastic
approximation of the full gradient based on a single data-case or
a few data-cases (a mini-batch). Another effective heuristic to speed up learning
is to add a "momentum term" to the gradient (Plaut et al, 1986).
The first term of the gradient (Eqn.(4) can be interpreted as increasing the probability of expert \(i\) on the dataset. The second term on the other hand can be interpreted as decreasing the probability of expert \(i\) in regions of input space where the model assigns high probability. When these terms balance, learning has converged to a local maximum of the log-likelihood.
The simplicity of the gradient in eqn.(4) is deceptive: it requires the evaluation of an intractable average over \(P(x)\ .\) For most interesting models this average requires methods like MCMC sampling to approximate it. But MCMC sampling is computationally expensive and results in high-variance estimates of the required averages. A cheaper, lower-variance alternative was proposed by (Hinton 2002) under the name contrastive divergence (CD). The idea is to run \(N\) samplers in parallel, one for each data-case in the (mini-)batch. These samplers must be initialized at the respective data-cases and will move towards equilibrium using MCMC sampling. After only a few steps of sampling, long before the MCMC converges, there is usually sufficient signal in the population of samples to change the parameters. A surrogate learning rule can be derived by replacing the log-likelihood with a new objective: the contrastive divergence \(KL(P_{data}||P_{model})-KL(P_k||P_{model})\) where \(P_{data}\) is the empirical distribution \(P_{data}=\frac{1}{N}\sum_n \delta(x-x_n)\ ,\) \(P_{model}\) is the current estimate of the model distribution and \(P_k\) is the distribution based on \(k\) steps of sampling. Taking gradients and ignoring a term which is usually very small, the new learning rule is almost identical to the one based on the log-likelihood, but using the approximation \[ \langle \log f_j(x)\rangle_{P(x)} \approx \frac{1}{N} \sum_n f_j(x_n^k) \] where \(x_n^k\) is the sample obtained from MCMC sampler \(n\) after \(k\) steps of sampling.
One can view this approximation as trading variance for bias. Thus, at convergence, we expect that the estimates of the parameters will not be equal to those of maximum likelihood learning, but will be slightly biased. To correct this, one can increase \(k\) close to convergence (Carreira-Perpinan and Hinton 2005).
Perhaps the simplest PoE is given by a restricted Boltzmann machine (see Figure 1). In this model there are two layers of binary (0/1) variables where the bottom layer is observed while the top layer remains unobserved or hidden. The joint probability distribution over hidden and observed variables is given as,
\[\tag{6} P(x,h)=\frac{1}{Z} \exp\left(\sum_i \alpha_ix_i + \sum_j \beta_j h_j + \sum_{ij} W_{ij}x_ih_j\right) \]
where the undirected edges in the graphical model in Figure 1 are representing
\(\{W_{ij}\}\ .\) The bias terms are parameterized by
\(\{\alpha_i,\beta_j\}\ .\) Marginalizing over
\(\{h_j\}\) the PoE structure becomes evident,
\[\tag{7} P(x)=\frac{1}{\tilde{Z}} \prod_i\exp(\alpha_ix_i)\prod_j\left(1+\exp(\beta_j+\sum_iW_{ij}x_i)\right) \]
where elements in the first product represent single-variable experts and elements in the second product represent constraints between the input variables.
The conditional Bernoulli expert distributions can be generalized to distributions in the exponential family . The resulting joint model is called an exponential family harmonium (EFH) (Welling et. al. 2004). The joint distribution can be obtained by replacing \(x_i\rightarrow f_i(x_i)\) and \(h_j\rightarrow g_j(h_j)\ .\) where \(f(\cdot)\) and \(g(\cdot)\) are the features for the corresponding exponential family distribution.
The special bipartite structure of the RBM and EFH results in a very efficient Gibbs sampler that alternates between sampling all hidden variables independently given values for the observed variables and vice versa sampling all visible variables independently given values for the hidden variables. The efficient Gibbs sampler directly translates into an efficient contrastive divergence learning algorithm (see previous section Figure 1).
Noiseless Independent Components Analysis (ICA) (Comon 1994) with an equal number of input dimensions and source distributions can be written as a PoE model as follows,
\[\tag{8} P(x|\{w_j\}) = |\det(W)|~\prod_{j=1}^M p_j\left(\sum_i w_{ji} x_i\right) \] where \(W\) is the matrix with elements \(w_{ji}\ .\)
Note that each expert, \(p_j\ ,\) is defined as a distribution on a one-dimensional projection of the input space. One can think of each projection as a "source" and a linear combination of these sources generated the signal. Unless \(W\) is rank deficient, the product is a well defined distribution over the entire input space.
Choosing heavy tailed Student-T distributions as the experts one obtains the general form of the "Products of Student-T" distribution (PoT) (Welling et. al. 2002). The PoT can be represented with the help of auxiliary variables (taking the role of hidden variables) as follows,
\[\tag{9} P(x,h)=\frac{1}{Z} \prod_{j=1}^M\exp\left(-h_j[1+\frac{1}{2} (\sum_iw_{ji}x_i)^2] + (1-\alpha_j)\log h_j \right) \]
where \(P(x|h)\) is
a full covariance Gaussian distribution and \(P(h|x)\) a
product of Gamma distributions.
The PoT becomes different from ICA if one chooses the number of experts to be larger than the number of input dimensions (a.k.a. an over-complete representation). In this case marginal independence between the hidden variables is lost, but conditional independence between the hidden variables is retained. Over-complete variants of ICA that retain marginal independence have also been proposed (Lewicki and Sejnowski 1998). Over-complete ICA models have conditional dependencies between the hidden variables known as explaining away which makes inference difficult. In contrast, for the over-complete PoT model inference over the hidden variables given observations is trivial due to the absence of such conditional dependencies (Teh et. al. 2003).
Instead of the non-Gaussian experts used for ICA, one can also choose an under-complete (\(M<d\)) set of one-dimensional Gaussian experts, i.e. \(p_j\left(\sum_i w_{ji} x_i\right)\) with \(p_j(\cdot)\) Gaussian. Using the fact that the inverse-covariance of the product is equal to the sum of the inverse-covariances of the individual Gaussian experts one can formulate probabilistic principal component analysis (Roweis, 1997; Tipping and Bishop, 1999) or probabilistic minor component analysis (Williams and Agakov, 2002). However, it is also possible to formulate a model that extracts the optimal combination of principal and minor components in the spectrum of the sample covariance matrix. The probabilistic model, known as ``eXtreme Components Analysis (XCA), is described in (Welling et. al. 2003).
Variants of PoEs have been applied under different names to various data-domains: for example the rate-coded RBM to face recognition (Teh and Hinton 2001), the dual wing harmonium to video-track data (Xing et. al. 2005), the rate adapting Poisson model (see Figure 1) to text and image data (Gehler et. al 2006), the product of HMMs model to language data (Brown and Hinton 2001), hierarchical versions of PoE to digits (Hinton et. al 2006), text and collaborative filtering data (Salakhutdinov et. al. 2007).
Internal references
Independent Component Analysis, Mixture of Experts, Pattern Recognition