{"title": "Bayesian Model Comparison and Backprop Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 839, "page_last": 846, "abstract": null, "full_text": "Bayesian Model Comparison and Backprop Nets \n\nDavid J.C. MacKay\u00b7 \n\nComputation and Neural Systems \n\nCalifornia Institute of Technology 139-14 \n\nPasadena CA 91125 \n\nmackayGras.phy.cam.ac.uk \n\nAbstract \n\nThe Bayesian model comparison framework is reviewed, and the Bayesian \nOccam's razor is explained. This framework can be applied to feedforward \nnetworks, making possible (1) objective comparisons between solutions \nusing alternative network architectures; (2) objective choice of magnitude \nand type of weight decay terms; (3) quantified estimates of the error bars \non network parameters and on network output. The framework also gen(cid:173)\nerates a measure of the effective number of parameters determined by the \ndata. \nThe relationship of Bayesian model comparison to recent work on pre(cid:173)\ndiction of generalisation ability (Guyon et al., 1992, Moody, 1992) is dis(cid:173)\ncussed. \n\n1 BAYESIAN INFERENCE AND OCCAM'S RAZOR \n\nIn science, a central task is to develop and compare models to account for the data \nthat are gathered. Typically, two levels of inference are involved in the task of \ndata modelling. At the first level of inference, we assume that one of the models \nthat we invented is true, and we fit that model to the data. Typically a model \nincludes some free parameters; fitting the model to the data involves inferring what \nvalues those parameters should probably take, given the data. This is repeated for \neach model. The second level of inference is the task of model comparison. Here, \n\n\u00b7Current address: Darwin College, Cambridge CB3 9EU, U.K. \n\n839 \n\n\f840 \n\nMacKay \n\nwe wish to compare the models in the light of the data, and assign some sort of \npreference or ranking to the alternatives.1 \n\nFor example, consider the task of interpolating a noisy data set. The data set \ncould be interpolated using a splines model, polynomials, or feedforward neural \nnetworks. At the first level of inference, we find for each individual model the best \nfit interpolant (a process sometimes known as 'learning'). At the second level of \ninference we want to rank the alternative models and state for our particular data \nset that, for example, 'splines are probably the best interpolation model', or 'if the \ninterpolant is modelled as a polynomial, it should probably be a cubic', or 'the best \nneural network for this data set has eight hidden units'. \n\nModel comparison is a difficult task because it is not possible simply to choose the \nmodel that fits the data best: more complex models can always fit the data better, \nso the maximum likelihood model choice leads us inevitably to implausible over(cid:173)\nparameterised models which generalise poorly. 'Occam's razor' is the principle that \nstates that unnecessarily complex models should not be preferred to simpler ones. \nBayesian methods automatically and quantitatively embody Occam's razor (Gull, \n1988, Jeffreys, 1939), without the introduction of ad hoc penalty terms. Complex \nmodels are automatically self-penalising under Bayes' rule. \n\nLet us write down Bayes' rule for the two levels of inference described above . As(cid:173)\nsume each model 1ii has a vector of parameters w. A model is defined by its \nfunctional form and two probability distributions: a 'prior' distribution P(WI1ii) \nwhich states what values the model's parameters might plausibly take; and the pre(cid:173)\ndictions P(Dlw, 1ii) that the model makes about the data D when its parameters \nhave a particular value w. Note that models with the same parameterisation but \ndifferent priors over the parameters are therefore defined to be different models. \n1. Model fitting. At the first level of inference, we assume that one model 1ii \nis true, and we infer what the model's parameters w might be given the data D. \nUsing Bayes' rule, the posterior probability of the parameters w is: \n\n(1) \n\nIn words: \n\n. \n\nPostenor = \n\nLikelihood X Prior \n\n. d \n\nEVI ence \n\nIt is common to use gradient-based methods to find the maximum of the posterior, \nwhich defines the most probable value for the parameters, W MP ; it is then common \nto summarise the posterior distribution by the value of W MP , and error bars on \nthese best fit parameters. The error bars are obtained from the curvature of the \nposterior; writing the Hessian A = -\\7\\7 log P(wID, 1ii) and Taylor-expanding the \nlog posterior with ~w = w - W MP , \n\n(2) \n\n1 Note that both levels of inference are distinct from decision theory. The goal of infer(cid:173)\n\nence is, given a defined hypothesis space and a particular data set, to assign probabilities \nto hypotheses. Decision theory chooses between alternative actions on the basis of these \nprobabilities so as to minimise the expectation of a 'loss function'. \n\n\fBayesian Model Comparison and Backprop Nets \n\n841 \n\nw \n\nFigure 1: The Occam factor \n\nThis figure shows the quantities that determine the Occam factor for a hypothesis 1ii hav(cid:173)\nin\u00a7 a single parameter w. The prior distribution (dotted line) for the parameter has width \nIl. w. The posterior distribution (solid line) has a single peak at WMP with characteristic \nwidth Il.w. The Occam factor is :b~. \n\nwe see that the posterior can be locally approximated as a gaussian with covariance \nmatrix (error bars) A -1. \n\n2. Model comparison. At the second level of inference, we wish to infer which \nmodel is most plausible given the data. The posterior probability of each model is: \n(3) \nNotice that the objective data-dependent term P(DI1id is the evidence for 1ii, \nwhich appeared as the normalising constant in (1). The second term, P(1ii ), is a \n'subjective' prior over our hypothesis space. Assuming that we have no reason to \nassign strongly differing priors P(1ii) to the alternative models, models 1ii are \nranked by evaluating the evidence. \n\nP(1ii ID) ex P(DI1ii )P(1ii ) \n\nThis concept is very general: the evidence can be evaluated for parametric and \n'non-parametric' models alike; whether our data modelling task is a regression \nproblem, a classification problem, or a density estimation problem, the evidence \nis the Bayesian's transportable quantity for comparing alternative models. In all \nthese cases the evidence naturally embodies Occam's razor, as we will now see. The \nevidence is the normalising constant for equation (1): \n\nP(D l1ii) = J P(Dlw, 1ii)P(wl1id dw \n\n(4) \n\nFor many problems, including interpolation, it is common for the posterior \nP(wID,1ii) ex P(Dlw,1ij)P(wl1ii) to have a strong peak at the most probable \nparameters W MP (figure 1). Then the evidence can be approximated by the height \nof the peak of the integrand P(Dlw, 1ii)P(wl1ii) times its width, ~w: \n\nEvidence \n\n-\n-\n\nP(D IwMP ' 1ii) \n, \nBest fit likelihood \n\ny \n\n~ \n\nP(wMPI1ii) ~w \n, \nOccam factor \n\ny \n\n~ \n\n(5) \n\nThus the evidence is found by taking the best fit likelihood that the model can \nachieve and multiplying it by an 'Occam factor' (Gull, 1988), which is a term with \nmagnitude less than one that penalises 1ii for having the parameter w. \n\n\f842 \n\nMacKay \n\nInterpretation of the Occam factor \n\nThe quantity ~ w is the posterior uncertainty in w. Imagine for simplicity that \nthe prior P(wlllj) is uniform on some large interval ~ow (figure 1), so that \nP(wMPllli) = AJW; then \n\n~w \nOccam factor = ~ ow' \n\ni.e. the ratio of the posterior accessible volume oflli's parameter space to \nthe prior accessible volume (Gull, 1988, Jeffreys, 1939). The log of the Occam \nfactor can be interpreted as the amount of information we gain abou t the model lli \nwhen the data arrive. \n\nTypically, a complex or flexible model with many parameters, each of which is free \nto vary over a large range ~ ow, will be penalised with a larger Occam factor than \na simpler model. The Occam factor also penalises models which have to be finely \ntuned to fit the data. Which model achieves the greatest evidence is determined \nby a trade-off between minimising this natural complexity measure and minimising \nthe data misfit. \n\nOccam factor for several parameters \n\nIf w is k-dimensional, and if the posterior is well approximated by a gaussian, the \nOccam factor is given by the determinant of the gaussian's covariance matrix: \n\nwhere A = - VV log P(wID, lli), the Hessian which we already evaluated when we \ncalculated the error bars on W MP . As the amount of data collected, N, increases, \nthis gaussian approximation is expected to become increasingly accurate on account \nof the central limit theorem. \n\nThus Bayesian model selection is a simple extension of maximum likelihood model \nselection: the evidence is obtained by multiplying the best fit likelihood \nby the Occam factor. To evaluate the Occam factor all we need is the Hessian \nA, if the gaussian approximation is good. Thus the Bayesian method of model \ncomparison by evaluating the evidence is computationally no more demanding than \nthe task of finding for each model the best fit parameters and their error bars. \n\n2 THE EVIDENCE FOR NEURAL NETWORKS \n\nNeural network learning procedures include a host of control parameters such as \nthe number of hidden units and weight decay rates. These parameters are difficult \nto set because there is an Occam's razor problem: if we just set the parameters \nso as to minimise the error on the training set, we would be led to over-complex \nand under-regularised models which over-fit the data. Figure 2a illustrates this \nproblem by showing the test error versus the training error of a hundred networks \nof varying complexity all trained on the same interpolation problem. \n\n\fBayesian Model Comparison and Backprop Nets \n\n843 \n\nOf course if we had unlimited resources, we could compare these networks by mea(cid:173)\nsuring the error on an unseen test set or by similar cross-validation techniques. \nHowever these techniques may require us to devote a large amount of data to the \ntest set, and may be computationally demanding. If there are several parameters \nlike weight decay rates, it is preferable if they can be optimised on line. \n\nUsing the Bayesian framework, it is possible for all our data to have a say in both the \nmodel fitting and the model comparison levels of inference. We can rank alternative \nneural network solutions by evaluating the 'evidence'. Weight decay rates can also \nbe optimised by finding the 'most probable' weight decay rate. Alternative weight \ndecay schemes can be compared using the evidence. The evidence also makes it \npossible to compare neural network solutions with other interpolation models, for \nexample, splines or radial basis functions, and to choose control parameters such \nas spline order or RBF kernel. The framework can be applied to classification \nnetworks as well as the interpolation networks discussed here. For details of the \ntheoretical framework (which is due to Gull and Skilling (1989\u00bb and for more \ncomplete discussion and bibliography, MacKay (1991) should be consulted. \n\n2.1 THE PROBABILISTIC INTERPRETATION \nFitting a backprop network to a data set D = {x, t} often involves minimising an \nobjective function M(w) = {3ED(W; D) + aEw(w). The first term is the data er(cid:173)\nror, for example ED = L !(Y -\nt)2, and the second term is a regulariser (weight \ndecay term), for example Ew = L !w~ . (There may be several regularisers with \nindependent constants {a c}. The Bayesian framework also covers that case.) A \nmodel 1\u00a3 has three components {A,N, 'R}: The architecture A specifies the func(cid:173)\ntional dependence of the input-output mapping on the network's parameters w. \nThe noise model N specifies the functional form of the data error. Within the \nprobabilistic interpretation (Tishby et ai., 1989), the data error is viewed as relat(cid:173)\ning to a likelihood, P(Dlw,{3,A,N) = exp(-{3ED )/ZD. For example, a quadratic \nED corresponds to the assumption that the distribution of errors between the data \nand the true interpolant is Gaussian, with variance u~ = 1/ {3. Lastly, the regu(cid:173)\nlariser 'R, with associated regularisation constant a, is interpreted as specifying a \nprior on the parameters w, P(wla,A, 'R) = exp( -aEw). For example, the use of \na plain quadratic regulariser corresponds to a Gaussian prior distribution for the \nparameters. \n\nGiven this probabilistic interpretation, interpolation with neural networks can then \nbe decomposed into three levels of inference: \n{3 1\u00a3.) = P(Dlw, {3, 1\u00a3j)P(wla, 1\u00a3i) \nP( ID \n\nFitting a regularised model \n\n1 \n\n2a Setting regularisation con(cid:173)\nstants and estimating nOIse \nlevel \n\n2 Model comparison \n\n, \n\n, a \n\na \n\n,a, , \n\nw \nP(a {3ID 1\u00a3-) = P(Dla,{3,1\u00a3i)P(a, {311\u00a3i) \n\nP(Dla,{3,1id \n\nP(DI1ij) \n\nBoth levels 2a and 2 require Occam's razor. For both levels the key step is to \nevaluate the evidence P(Dla,{3,1\u00a3), which, within the quadratic approximation \n\n\f844 \n\nMacKay \n\neo \n\n.. \nI ... \nI \n... \n\nuo \n\n... \n.. . 'i'~ \u2022 \n... \nJ ... \ni \nJ \n... \n.. \n\ntao \n\no \n\n. . \n\n. . .. ,~. ' .. \n. . '\\ .. :, \n\n, \n\u2022 I \n\nb) JOe \n\n... \n\n... \n\n\u2022 \n\n\u2022 \n\n_.-\n... \n\n.. \n\nIII \n\nFigure 2: The evidence solves the neural networks' Occam problem \n\na) Test error vs. data error. Each point represents the performance of a single trained \nneural network on the training set and on the test set. This graph illustrates the fact that \nthe best generalisation is not achieved by the models which fit the training data best. \nb) Log Evidence vs. test error. \n\nX!. = 2aEw = 'Y \n\nX~ = 2,BED = N - 'Y, \n\naround w MP , is given by: \nlog P(Dla,,B, 1-\u00a3) = -aEW -,BE~P -210g det A-log Zw(a)-log ZD (,B) + '2 log 27r. \n(1) \nAt level 2a we can find the most probable value for the regularisation constant a \nand noise level 1/,B by differentiating (1) with respect to a and ,B. The result is \n\nk \n\n1 \n\n(8) \nwhere'Y is 'the effective number of parameters determined by the data' (Gull, 1989), \n\nand \n\n'Y = k - aTraceA: = ~ A \n\n-1 \n\nAa \n\n\" \na=1 a + a \n\nk \n\n' \n\n(9) \n\nwhere Aa are the eigenvalues of 'V''V' ,BED in the natural basis of Ew. Each term \nin the sum is a number between 0 and 1 which measures how well one parame(cid:173)\nter is determined by the data rather than by the prior. The expressions (8), or \napproximations to them, can be used to re-estimate weight decay rates on line. \nThe central quantity in the evidence and in 'Y is the inverse hessian kl, which \ndescribes the error bars on the parameters w. From this we can also obtain error \nbars on the outputs of a network (Denker and Le Cun, 1991, MacKay, 1991). These \nerror bars are closely related to the predicted generalisation error calculated by \nLevin et al.(1989). In (MacKay, 1991) the practical utility of these error bars is \ndemonstrated for both regression and classification networks. \n\nFigure 2b shows the Bayesian 'evidence' for each of the solutions in figure 2a against \nthe test error. It can be seen that the correlation between the evidence and the \ntest error is extremely good. This good correlation depends on the model being \nwell-matched to the problem; when an inconsistent weight decay scheme was used \n(forcing all weights to decay at the same rate), it was found that the correlation be(cid:173)\ntween the evidence and the test error was much poorer. Such comparisons between \nBayesian and traditional methods are powerful tools for human learning. \n\n\fBayesian Model Comparison and Backprop Nets \n\n845 \n\n3 RELATION TO THEORIES OF GENERALISATION \n\nThe Bayesian 'evidence' framework assesses within a well-defined hypothesis space \nhow probable a set of alternative models are. However, what we really want to \nknow is how well each model is expected to generalise. Empirically, the correlation \nbetween the evidence and generalisation error is surprisingly good. But a theoretical \nconnection linking the two is not yet established. Here, a brief discussion is given \nof similarities and differences between the evidence and quantities arising in recent \nwork on prediction of generalisation error. \n\n3.1 RELATION TO MOODY'S 'G.P.E.' \n\nMoody's (1992) 'Generalised Prediction Error' is a generalisation of Akaike's \n'F .P.E.' to non-linear regularised models. The F .P.E. is an estimator of generalisa(cid:173)\ntion error which can be derived without making assumptions about the distribution \nof errors between the data and true interpolant, and without assuming a known \nclass to which the true interpolant belongs. The difference between F .P.E. and \nG.P.E. is that the total number of parameters k in F.P.E. is replaced by an effective \nnumber of parameters, which is in fact identical to the quantity -y arising in the \nBayesian analysis (9). If ED is as defined earlier, \n\nG.P.E. = (ED + u~-y) IN. \n\n(10) \n\nLike the log evidence, the G .P.E. has the form of the data error plus a term that \npenalises complexity. However, although the same quantity -y arises in the Bayesian \nanalysis, the Bayesian Occam factor does not have the same scaling behaviour as the \nG.P.E. term (see discussion below). And empirically, the G.P.E. is not always a good \npredictor of generalisation. The reason for this is that in the derivation of the G.P.E., \nit is assumed that the distribution over x values is well approximated by a sum of \ndelta functions at the samples in the training set. This is equivalent to assuming test \nsamples will be drawn only at the x locations at which we have already received data. \nThis assumption breaks down for over-parameterised and over-flexible models. An \nadditional distinction that between the G.P.E. and the evidence framework is that \nthe G.P.E. is defined for regression problems only; the evidence can be evaluated \nfor regression, classification and density estimation models. \n\n3.2 RELATION TO THE EFFECTIVE V-C DIMENSION \n\nRecent work on 'structural risk minimisation' (Guyon et al., 1992) utilises empirical \nexpressions of the form: \n\nE \n\nlog(NI-y) + C2 \n\nN l-y \n\n(11) \n\n+ Cl \n\n,...., E IN \n\ngen -\n\nD \n\nwhere -y is the 'effective V-C dimension' of the model, and is identical to the quan(cid:173)\ntity arising in (9). The constants Cl and C2 are determined by experiment. The \nstructural risk theory is currently intended to be applied only to nested families of \nclassification models (hence the abscence of (3: ED is dimensionless) with monotonic \neffective V-C dimension, whereas the evidence can be evaluated for any models. \nHowever, it is very interesting that the scaling behaviour of this expression (11) is \n\n\f846 \n\nMacKay \n\nidentical to the scaling behaviour of the log evidence (1), subject to the following \nassumptions. Assume that the value of the regularisation constant satisfies (8). \nAssume furthermore that the significant eigenvalues (Aa > a) scale as Aa -- Na/'Y \n(It can be confirmed that this scaling is obtained for example in the interpolation \nmodels consisting of a sequence of steps of independent heights, as we vary the \nnumber of steps.) Then it can be shown that the scaling of the log evidence is: \n\n1 \n\n-log P(Dla,,B, 1i) '\" ,BE~P + 2 ('Y log(N /'Y) + 'Y) \n\n(12) \n(Readers familiar with MDL will recognise the dominant 1 log N term; MDL and \nBayes are identical.) Thus the scaling behaviour of the lo~ evidence is identical to \nthe structural risk minimisation expression (11), if Cl = - and C2 = 1. I. Guyon \n(personal communication) has confirmed that the empirically determined values for \nCl and C2 are indeed close to these Bayesian values. It will be interesting to try and \nunderstand and develop this relationship. \n\nAcknowledgements \n\nThis work was supported by studentships from Caltech and SERC, UK. \n\nReferences \n\nJ.S. Denker and Y. Le Cun (1991). 'Transforming neural-net output levels to prob(cid:173)\nability distributions', in Advances in neural information processing systems 3, ed. \nR.P. Lippmann et al., 853-859, Morgan Kaufmann. \n'Bayesian inductive inference and maximum entropy', in Maxi(cid:173)\nS.F. Gull (1988). \nmum Entropy and Bayesian Methods in science and engineering, vol. 1: Founda(cid:173)\ntions, G.J. Erickson and C.R. Smith, eds., Kluwer. \nS.F. Gull (1989). 'Developments in Maximum entropy data analysis', in J. Skilling, \ned., 53-11. \nI. Guyon, V.N. Vapnik, B.E. Boser, L.Y. Bottou and S.A. Solla (1992). 'Structural \nrisk minimization for character recognition', this volume. \nH. Jeffreys (1939). Theory of Probability, Oxford Univ. Press. \nE. Levin, N. Tishby and S. Solla (1989). \ngeneralization in layered neural networks', in COLT '89: 2nd workshop on compu(cid:173)\ntational learning theory, 245-260. \nD.J.C. MacKay (1991) 'Bayesian Methods for Adaptive Models', Ph.D. Thesis, Cal(cid:173)\ntech. Also 'Bayesian interpolation', 'A practical Bayesian framework for backprop \nnetworks', 'Information-based objective functions for active data selection', to ap(cid:173)\npear in Neural computation. And 'The evidence framework applied to classification \nnetworks', submitted to Neural computation. \nJ .E. Moody (1992). \nnonlinear learning systems', this volume. \nN. Tishby, E. Levin and S.A. Solla (1989). 'Consistent inference of probabilities in \nlayered networks: predictions and generalization', in Proc. [lCNN, Washington. \n\n'Generalization, regularization and architecture selection in \n\n'A statistical approach to learning and \n\n\f", "award": [], "sourceid": 488, "authors": [{"given_name": "David", "family_name": "MacKay", "institution": null}]}