7.6 贝叶斯线性回归

2019-07-23

| 机器学习 | | 线性回归 , 贝叶斯 , 后验 |

虽然岭回归是计算点估计的有用方法，但有时我们想要计算关于 $\boldsymbol{w}$ 和 $\sigma^2$ 的完全后验。为简单起见，我们首先假设噪声方差 $\sigma^2$ 是已知的，因此我们专注于计算 $p(\boldsymbol{w}| \mathcal{D},\sigma^2)$ 。然后在7.6.3节我们将考虑一般情况，也就是计算 $p(\boldsymbol{w},\sigma^2|\mathcal{D})$ 。我们假设始终是高斯似然模型。以稳健拟然执行贝叶斯推断也是可能的，但需要更高级的技术（参见练习24.5）。

7.6.1 计算后验

在线性回归中，拟然由下式给出

\begin{aligned} p(\boldsymbol{y}|\boldsymbol{X},\boldsymbol{w},\mu, \sigma^2) = & \mathcal{N}(\boldsymbol{y}|\mu+\boldsymbol{X}\boldsymbol{w},\sigma^2 \boldsymbol{I}_N) \\ \quad = & \exp \left(-\dfrac{1}{2\sigma^2}(\boldsymbol{y}-\mu \boldsymbol{1}_N-\boldsymbol{X}\boldsymbol{w})^T(\boldsymbol{y}-\mu \boldsymbol{1}_N-\boldsymbol{X}\boldsymbol{w})\right) \end{aligned} \tag{7.52-53}

其中 $\mu$ 是偏移项。如果输入已经中心化了，那么对于每个 $\sum_i{x_{ij}}=0,\forall j$ ，输出的均值同样可能为正或负。因此，让我们在 $\mu$ 上放置一个形如 $p(\mu) \propto 1$ 的不恰当的先验，然后将其积分得到

p(\boldsymbol{y}|\boldsymbol{X},\boldsymbol{w},\sigma^2) = \exp \left(-\dfrac{1}{2\sigma^2}\|\boldsymbol{y}-\bar{y} \boldsymbol{1}_N-\boldsymbol{X}\boldsymbol{w}\|_2^2\right) \tag{7.54}

其中 $\bar{y}=\frac{1}{N}\sum_{i=1}^N{y_i}$ 是输出的经验均值。为了简化符号，假设输出已经中心化，于是可将 $\boldsymbol{y}-\bar{y}\boldsymbol{1}_N$ 可简写成 $\boldsymbol{y}$ 。

在上述高斯似然之前的共轭也是高斯，我们将通过p（w）= N（w | w0，V0）来表示。使用贝叶斯规则求高斯，公式4.125，后验由下式给出

a \tag{7.55-58}

If w0 = 0 and V0 = τ 2I, then the posterior mean reduces to the ridge estimate, if we define λ = σ2 τ2 . This is because the mean and mode of a Gaussian are the same.

To gain insight into the posterior distribution (and not just its mode), let us consider a 1D example:

a \tag{7.59}

where the “true” parameters are w0 = −0.3 and w1 = 0.5. In Figure 7.11 we plot the prior, the likelihood, the posterior, and some samples from the posterior predictive. In particular, the right hand column plots the function y(x, w(s) ) where x ranges over [−1, 1], and w(s) ∼ N (w|wN , VN ) is a sample from the parameter posterior. Initially, when we sample from the prior (first row), our predictions are “all over the place”, since our prior is uniform. After we see one data point (second row), our posterior becomes constrained by the corresponding likelihood, and our predictions pass close to the observed data. However, we see that the posterior has a ridge-like shape, reflecting the fact that there are many possible solutions, with different slopes/intercepts. This makes sense since we cannot uniquely infer two parameters from one observation. After we see two data points (third row), the posterior becomes much narrower, and our predictions all have similar slopes and intercepts. After we observe 20 data points (last row), the posterior is essentially a delta function centered on the true value, indicated by a white cross. (The estimate converges to the truth since the data was generated from this model, and because Bayes is a consistent estimator; see Section 6.4.1 for discussion of this point.)

7.6.2 计算后验预测

It’s tough to make predictions, especially about the future. — Yogi Berra

In machine learning, we often care more about predictions than about interpreting the parameters. Using Equation 4.126, we can easily show that the posterior predictive distribution at a test point x is also Gaussian:

a \tag{7.60-62}

The variance in this prediction, σ2 N (x), depends on two terms: the variance of the observation noise, σ2, and the variance in the parameters, VN . The latter translates into variance about observations in a way which depends on how close x is to the training data D. This is illustrated in Figure 7.12, where we see that the error bars get larger as we move away from the training points, representing increased uncertainty. This is important for applications such as active learning, where we want to model what we don’t know as well as what we do. By contrast, the plugin approximation has constant sized error bars, since

a \tag{7.63}

See Figure 7.12(a).

7.6.3 当 $\sigma^2$ 未知时的贝叶斯推断*

In this section, we apply the results in Section 4.6.3 to the problem of computing p(w, σ2|D) for a linear regression model. This generalizes the results from Section 7.6.1 where we assumed σ2 was known. In the case where we use an uninformative prior, we will see some interesting connections to frequentist statistics.

7.6.3.1 共轭先验

As usual, the likelihood has the form

a \tag{7.64}

By analogy to Section 4.6.3, one can show that the natural conjugate prior has the following form:

a \tag{7.65-68}

With this prior and likelihood, one can show that the posterior has the following form:

a \tag{7.69-73}

The expressions for wN and VN are similar to the case where σ2 is known. The expression for aN is also intuitive, since it just updates the counts. The expression for bN can be interpreted as follows: it is the prior sum of squares, b0, plus the empirical sum of squares, yT y, plus a term due to the error in the prior on w.

The posterior marginals are as follows:

a \tag{7.74-75}

We give a worked example of using these equations in Section 7.6.3.3.

By analogy to Section 4.6.3.6, the posterior predictive distribution is a Student T distribution. In particular, given m new test inputs X˜ , we have

a \tag{7.76}

The predictive variance has two components: (bN /aN )Im due to the measurement noise, and (bN /aN )XV˜ N X˜ T due to the uncertainty in w. This latter terms varies depending on how close the test inputs are to the training data.

It is common to set a0 = b0 = 0, corresponding to an uninformative prior for σ2, and to set w0 = 0 and V0 = g(XT X)−1 for any positive value g. This is called Zellner’s g-prior (Zellner 1986). Here g plays a role analogous to 1/λ in ridge regression. However, the prior covariance is proportional to (XT X)−1 rather than I. This ensures that the posterior is invariant to scaling of the inputs (Minka 2000b). See also Exercise 7.10.

We will see below that if we use an uninformative prior, the posterior precision given N measurements is V−1 N = XT X. The unit information prior is defined to contain as much information as one sample (Kass and Wasserman 1995). To create a unit information prior for linear regression, we need to use V−1 0 = 1 N XT X, which is equivalent to the g-prior with g = N.

7.6.3.2 无信息先验

An uninformative prior can be obtained by considering the uninformative limit of the conjugate g-prior, which corresponds to setting g = ∞. This is equivalent to an improper NIG prior with w0 = 0, V0 = ∞I, a0 = 0 and b0 = 0, which gives p(w, σ2) ∝ σ−(D+2).

Alternatively, we can start with the semi-conjugate prior p(w, σ2) = p(w)p(σ2), and take each term to its uninformative limit individually, which gives p(w, σ2) ∝ σ−2. This is equivalent to an improper NIG prior with w0 = 0,V = ∞I, a0 = −D/2 and b0 = 0. The corresponding posterior is given by

a \tag{7.77-82}

The marginal distribution of the weights is given by

a \tag{7.83}

where C = (XT X)−1 and wˆ is the MLE. We discuss the implications of these equations below.

7.6.3.3 贝叶斯和频率推断重合的例子*

The use of a (semi-conjugate) uninformative prior is interesting because the resulting posterior turns out to be equivalent to the results from frequentist statistics (see also Section 4.6.3.9). In particular, from Equation 7.83 we have

a \tag{7.84}

This is equivalent to the sampling distribution of the MLE which is given by the following (see e.g., (Rice 1995, p542), (Casella and Berger 2002, p554)):

a \tag{7.85}

where

a \tag{7.86}

is the standard error of the estimated parameter. (See Section 6.2 for a discussion of sampling distributions.) Consequently, the frequentist confidence interval and the Bayesian marginal credible interval for the parameters are the same in this case.

As a worked example of this, consider the caterpillar dataset from (Marin and Robert 2007). (The details of what the data mean don’t matter for our present purposes.) We can compute the posterior mean and standard deviation, and the 95% credible intervals (CI) for the regression coefficients using Equation 7.84. The results are shown in Table 7.2. It is easy to check that these 95% credible intervals are identical to the 95% confidence intervals computed using standard frequentist methods (see linregBayesCaterpillar for the code).

We can also use these marginal posteriors to compute if the coefficients are “significantly” different from 0. An informal way to do this (without using decision theory) is to check if its 95% CI excludes 0. From Table 7.2, we see that the CIs for coefficients 0, 1, 2, 4, 5 are all significant by this measure, so we put a little star by them. It is easy to check that these results are the same as those produced by standard frequentist software packages which compute p-values at the 5% level.

Although the correspondence between the Bayesian and frequentist results might seem appealing to some readers, recall from Section 6.6 that frequentist inference is riddled with pathologies. Also, note that the MLE does not even exist when NN.)

7.6.4 经验贝叶斯用于线性回归（证据程序）

So far, we have assumed the prior is known. In this section, we describe an empirical Bayes procedure for picking the hyper-parameters. More precisely, we choose η = (α, λ) to maximize the marignal likelihood, where λ = 1/σ2 be the precision of the observation noise and α is the precision of the prior, p(w) = N (w|0, α−1I). This is known as the evidence procedure (MacKay 1995b). See Section 13.7.4 for the algorithmic details

The evidence procedure provides an alternative to using cross validation. For example, in Figure 7.13(b), we plot the log marginal likelihood for different values of α, as well as the maximum value found by the optimizer. We see that, in this example, we get the same result as 5-CV, shown in Figure 7.13(a). (We kept λ = 1/σ2 fixed in both methods, to make them comparable.)

The principle practical advantage of the evidence procedure over CV will become apparent in Section 13.7, where we generalize the prior by allowing a different αj for every feature. This can be used to perform feature selection, using a technique known as automatic relevancy determination or ARD. By contrast, it would not be possible to use CV to tune D different hyper-parameters

The evidence procedure is also useful when comparing different kinds of models, since it provides a good approximation to the evidence:

\tag{7.87-88}

It is important to (at least approximately) integrate over η rather than setting it arbitrarily, for reasons discussed in Section 5.3.2.5. Indeed, this is the method we used to evaluate the marginal likelihood for the polynomial regression models in Figures 5.7 and 5.8. For a “more Bayesian” approach, in which we model our uncertainty about η rather than computing point estimates, see Section 21.5.2.

返回本章目录