回归分析03:回归参数的估计(1)(Regression analysis 03: estimation of regression parameters (1))

  • Chapter 3:回归参数的估计(1)3.1 最小二乘估计3.2 最小二乘估计的性质
  • 3.1 最小二乘估计
  • 3.2 最小二乘估计的性质

Chapter 3:回归参数的估计(1)

3.1 最小二乘估计

用 \(y\) 表示因变量,\(x_1,x_2,\cdots,x_p\) 表示对 \(y\) 有影响的 \(p\) 个自变量。

  • 总体回归模型:假设 \(y\) 和 \(x_1,x_2,\cdots,x_p\) 之间满足如下线性关系式
    \[y=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p+e \ ,
    \]其中 \(e\) 是随机误差,将 \(\beta_0\) 称为回归常数,将 \(\beta_1,\beta_1,\cdots,\beta_p\) 称为回归系数。
  • 总体回归函数:定量地刻画因变量的条件均值与自变量之间的相依关系,即
    \[{\rm E}(y|x)=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p \ ,
    \]回归分析的首要目标就是估计回归函数。

假定已有因变量 \(y\) 和自变量 \(x_1,x_2,\cdots,x_p\) 的 \(n\) 组观测样本 \(\left(x_{i1},x_{i2},\cdots,x_{ip}\right),\,i=1,2,\cdots,n\) 。

  • 样本回归模型:样本观测值满足如下线性方程组
  • Gauss-Markov 假设:随机误差项 \(e_i,\,i=1,2,\cdots,n\) 满足如下假设:

    零均值:\({\rm E}(e_i)=0\) ;
    同方差:\({\rm Var}(e_i)=\sigma^2\) ;
    不相关:\({\rm Cov}(e_i,e_j)=0 \ , \ \ i\neq j\) 。

  • 零均值:\({\rm E}(e_i)=0\) ;
  • 同方差:\({\rm Var}(e_i)=\sigma^2\) ;
  • 不相关:\({\rm Cov}(e_i,e_j)=0 \ , \ \ i\neq j\) 。

如果将样本回归模型中的线性方程组,用矩阵形式表示为

其中 \(X\) 称为设计矩阵。若将 Gauss-Markov 假设也用矩阵形式表示为

将矩阵方程和 Gauss-Markov 假设合写在一起,即可得到最基本的线性回归模型

最小二乘估计:寻找一个 \(\beta\) 的估计,使得误差向量 \(e=Y-X\beta\) 的长度的平方达到最小。设

对 \(\beta\) 求导,令其等于零,可得正规方程组

正规方程组有唯一解的充要条件是 \({\rm rank}\left(X’X\right)=p+1\) ,这等价于 \({\rm rank}(X)=p+1\) ,即 \(X\) 是列满秩的。正规方程组的唯一解为

以上的讨论说明 \(\hat\beta\) 是 \(Q(\beta)\) 的一个驻点,下面证明 \(\hat\beta\) 是 \(Q(\beta)\) 的最小值点。
对任意的 \(\beta\in\mathbb{R}^{p+1}\) ,有
\[\begin{aligned}
\|Y-X\beta\|^2&=\left\|Y-X\hat\beta+X\left(\hat\beta-\beta\right)\right\|^2 \\ \\
&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2+2\left(\hat\beta-\beta\right)’X’\left(Y-X\hat\beta\right) \ .
\end{aligned}
\]因为 \(\hat\beta\) 满足正规方程组 \(X’X\hat\beta=X’Y\) ,所以 \(X’\left(Y-X\hat\beta\right)=0\) ,所以对任意的 \(\beta\in\mathbb{R}^{p+1}\) ,有
\[\begin{aligned}
\|Y-X\beta\|^2&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2 \ .
\end{aligned}
\]所以有
\[Q(\beta)=\|Y-X\beta\|^2\geq \left\|Y-X\hat\beta\right\|^2=Q\left(\hat\beta\right) \ .
\]当且仅当 \(\beta=\hat\beta\) 时等号成立。

以上的讨论说明 \(\hat\beta\) 是 \(Q(\beta)\) 的一个驻点,下面证明 \(\hat\beta\) 是 \(Q(\beta)\) 的最小值点。

对任意的 \(\beta\in\mathbb{R}^{p+1}\) ,有

因为 \(\hat\beta\) 满足正规方程组 \(X’X\hat\beta=X’Y\) ,所以 \(X’\left(Y-X\hat\beta\right)=0\) ,所以对任意的 \(\beta\in\mathbb{R}^{p+1}\) ,有

所以有

当且仅当 \(\beta=\hat\beta\) 时等号成立。

我们将 \(\hat{Y}=X\hat\beta\) 称为 \(Y\) 的拟合值向量或投影向量,注意到

我们将 \(H=X\left(X’X\right)^{-1}X’\) 称为帽子矩阵,它是自变量空间的投影矩阵,这里的自变量空间指的是矩阵 \(X\) 的列空间。此外,我们将 \(\hat{e}=Y-\hat{Y}=(I-H)Y\) 称为残差向量。

中心化模型:将原始数据进行中心化,令

将样本回归模型改写为

其中 \(\alpha=\beta_0+\beta_1\bar{x}_1+\beta_2\bar{x}_2+\cdots+\beta_p\bar{x}_p\) 。定义设计矩阵为

将中心化模型写成矩阵形式:

其中 \(\beta=\left(\beta_1,\beta_2,\cdots,\beta_p\right)’\) 。注意到

因此正规方程组可以写为

解得回归参数的最小二乘估计为

标准化模型:将原始数据进行标准化,令

将样本回归模型改写为

令 \(Z=(z_{ij})_{n\times p}\) ,将标准化模型写成矩阵形式:

解得回归参数的最小二乘估计为

这里矩阵 \(Z\) 具有如下性质:

其中 \(r_{ij}\) 为自变量 \(x_i\) 和 \(x_j\) 的样本相关系数,矩阵 \(R\) 是自变量的样本相关系数矩阵。

3.2 最小二乘估计的性质

设线性回归模型满足 Gauss-Markov 假设,即

下面我们来讨论最小二乘估计 \(\hat\beta=\left(X’X\right)^{-1}X’Y\) 的一些良好的性质。

定理 3.2.1:对于线性回归模型,最小二乘估计 \(\hat\beta=\left(X’X\right)^{-1}X’Y\) 具有下列性质:

(1) \({\rm E}\left(\hat\beta\right)=\beta\) 。

(2) \({\rm Cov}\left(\hat\beta\right)=\sigma^2\left(X’X\right)^{-1}\) 。

(1) 因为 \({\rm E}(Y)=X\beta\) ,所以
\[{\rm E}\left(\hat\beta\right)=\left(X’X\right)^{-1}X'{\rm E}(Y)=\left(X’X\right)^{-1}X’X\beta=\beta \ .
\](2) 因为 \({\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n\) ,所以
\[\begin{aligned}
{\rm Cov}\left(\hat\beta\right)&={\rm Cov}\left(\left(X’X\right)^{-1}X’Y\right) \\ \\
&=\left(X’X\right)^{-1}X'{\rm Cov}(Y)X\left(X’X\right)^{-1} \\ \\
&=\left(X’X\right)^{-1}X\sigma^2I_nX\left(X’X\right)^{-1} \\ \\
&=\sigma^2\left(X’X\right)^{-1} \ .
\end{aligned}
\]

(1) 因为 \({\rm E}(Y)=X\beta\) ,所以

(2) 因为 \({\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n\) ,所以

推论 3.2.1:设 \(c\) 是 \(p+1\) 维常数向量,我们称 \(c’\hat\beta\) 是 \(c’\beta\) 的最小二乘估计,具有下列性质:

(1) \({\rm E}\left(c’\hat\beta\right)=c’\beta\) 。

(2) \({\rm Cov}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c\) 。

该推论说明,对任意的线性函数 \(c’\beta\) ,都有 \(c’\hat\beta\) 是 \(c’\beta\) 的无偏估计,

定理 3.2.2 (Gauss-Markov):对于线性回归模型,在 \(c’\beta\) 的所有线性无偏估计中,最小二乘估计 \(c’\hat\beta\) 是唯一的最小方差线性无偏估计 (best linear unbiased estimator, BLUE) 。

假设 \(a’Y\) 是 \(c’\beta\) 的一个线性无偏估计,则对 \(\forall\beta\in\mathbb{R}^{p+1}\) ,都有
\[{\rm E}\left(a’Y\right)=a’X\beta=c’\beta \ .
\]所以 \(a’X=c’\) 。又因为
\[\begin{aligned}
&{\rm Var}(a’Y)=\sigma^2a’a=\sigma^2\|a\|^2 \ , \\ \\
&{\rm Var}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c \ ,
\end{aligned}
\]对 \(\|a\|^2\) 做分解有
\[\begin{aligned}
\|a\|^2&=\left\|a-X\left(X’X\right)^{-1}c+X\left(X’X\right)^{-1}c\right\|^2 \\ \\
&=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 +2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right) \\ \\
&=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 \ .
\end{aligned}
\]最后一个等号是因为
\[\begin{aligned}
2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right)&=2c’\left(X’X\right)^{-1}\left(X’a-c\right)=0 \ .
\end{aligned}
\]代入 \(a’Y\) 的方差,所以
\[\begin{aligned}
{\rm Var}\left(a’Y\right)&=\sigma^2\|a\|^2 \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2\left\|X\left(X’X\right)^{-1}c\right\|^2 \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2c’\left(X’X\right)^{-1}X’X\left(X’X\right)^{-1}c \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+{\rm Var}\left(c’\hat\beta\right) \\ \\
&\geq{\rm Var}\left(c’\hat\beta\right) \ .
\end{aligned}
\]等号成立当且仅当 \(\left\|a-X\left(X’X\right)^{-1}c\right\|=0\) ,即 \(a=X\left(X’X\right)^{-1}c\) ,此时 \(c’Y=c’\hat\beta\) ,得证。

假设 \(a’Y\) 是 \(c’\beta\) 的一个线性无偏估计,则对 \(\forall\beta\in\mathbb{R}^{p+1}\) ,都有

所以 \(a’X=c’\) 。又因为

对 \(\|a\|^2\) 做分解有

最后一个等号是因为

代入 \(a’Y\) 的方差,所以

等号成立当且仅当 \(\left\|a-X\left(X’X\right)^{-1}c\right\|=0\) ,即 \(a=X\left(X’X\right)^{-1}c\) ,此时 \(c’Y=c’\hat\beta\) ,得证。

误差方差 \(\sigma^2\) 反映了模型误差对因变量的影响大小,下面来估计 \(\sigma^2\) 。

注意到误差向量 \(e=Y-X\beta\) 是不可观测的,用 \(\hat\beta\) 代替 \(\beta\) ,称

为残差向量。设 \(x_i’\) 为设计矩阵 \(X\) 的第 \(i\) 行,则第 \(i\) 次观测的残差可以表示为

称 \(\hat{y}_i\) 为第 \(i\) 次观测的拟合值,称 \(\hat{Y}\) 为拟合值向量。

将 \(\hat{e}\) 看作 \(e\) 的一个估计,定义残差平方和为

它从整体上反映了观测数据与回归直线的偏离程度。

定理 3.2.3:我们用 \({\rm RSS}\) 来构造 \(\sigma^2\) 的无偏估计量。

(a) \({\rm RSS}=Y’\left(I_n-X\left(X’X\right)^{-1}X’\right)Y=Y’\left(I_n-H\right)Y\) ;

(b) 若定义 \(\sigma^2\) 的估计量为

则 \(\hat\sigma^2\) 是 \(\sigma^2\) 的无偏估计量。

(a) 引入帽子矩阵 \(\hat{Y}=HY\) ,所以 \(\hat{e}=\left(I_n-H\right)Y\) ,所以
\[{\rm RSS}=\hat{e}’\hat{e}=Y'(I_n-H)'(I_n-H)Y=Y'(I_n-H)Y \ .
\](b) 把 \(Y=X\beta+e\) 代入 \({\rm RSS}\) 的表达式可得
\[\begin{aligned}
{\rm RSS}&=(X\beta+e)'(I_n-H)(X\beta+e) \\ \\
&=\beta’X'(I_n-H)X\beta+e'(I_n-H)e \\ \\
&=\beta’X’X\beta-\beta’X’X(X’X)^{-1}X’X\beta++e'(I_n-H)e \\ \\
&=e'(I_n-H)e \ .
\end{aligned}
\]由定理 2.2.1 可知
\[\begin{aligned}
{\rm E}\left({\rm RSS}\right)&={\rm E}\left[e'(I_n-H)e\right] \\ \\
&=0+{\rm tr}\left[(I_n-H)\sigma^2I_n\right] \\ \\
&=\sigma^2(n-{\rm tr}(H)) \ .
\end{aligned}
\]根据对称幂等矩阵的秩与迹相等这一性质可得
\[{\rm tr}(H)={\rm rank}(H)={\rm rank}(X) \ .
\]所以有
\[{\rm E}\left({\rm RSS}\right)=\sigma^2(n-{\rm rank}(X)) \ .
\]进而
\[\hat\sigma^2=\frac{\rm RSS}{n-{\rm rank}(X)}
\]是 \(\sigma^2\) 的无偏估计量。

(a) 引入帽子矩阵 \(\hat{Y}=HY\) ,所以 \(\hat{e}=\left(I_n-H\right)Y\) ,所以

(b) 把 \(Y=X\beta+e\) 代入 \({\rm RSS}\) 的表达式可得

由定理 2.2.1 可知

根据对称幂等矩阵的秩与迹相等这一性质可得

所以有

进而

是 \(\sigma^2\) 的无偏估计量。

如果误差向量 \(e\) 服从正态分布,即 \(e\sim N_n\left(0,\sigma^2I_n\right)\) ,则可以得到 \(\hat\beta\) 和 \(\hat\sigma^2\) 的更多性质。

定理 3.2.4:对于线性回归模型,如果误差向量 \(e\sim N_n\left(0,\sigma^2I_n\right)\) ,则

(a) \(\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right)\) ;

(b) \({\rm RSS}/\sigma^2\sim\chi^2(n-{\rm rank}(X))\) ;

(c) \(\hat\beta\) 与 \({\rm RSS}\) 相互独立。

(a) 注意到
\[\hat\beta=\left(X’X\right)^{-1}X’Y=\left(X’X\right)^{-1}X'(X\beta+e)=\beta+\left(X’X\right)^{-1}X’e \ .
\]由定理 2.3.4 和定理 3.2.1 可得
\[\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right) \ .
\](b) 注意到
\[\begin{aligned}
&\frac{e}{\sigma}\sim N(0,I_n) \ , \\ \\
&\frac{\rm RSS}{\sigma^2}=\frac{e'(I_n-H)e}{\sigma^2}=\left(\frac{e}{\sigma}\right)'(I_n-H)\left(\frac{e}{\sigma}\right) \ ,
\end{aligned}
\]根据对称幂等矩阵的秩与迹相等这一性质可得
\[{\rm rank}(I_n-H)={\rm tr}(I_n-H)=n-{\rm tr}(H)=n-{\rm rank}(H)=n-{\rm rank}(X) \ .
\]由定理 2.4.3 可得
\[\frac{\rm RSS}{\sigma^2}\sim\chi^2\left(n-{\rm rank}(X)\right) \ .
\](c) 因为 \(\hat\beta=\beta+\left(X’X\right)^{-1}X’e\) ,而 \({\rm RSS}=e’\left(I_n-H\right)e\) ,注意到
\[\left(X’X\right)^{-1}X’\cdot\sigma^2I_n\cdot\left(I_n-H\right)=0 \ ,
\]由推论 2.4.10 可知 \(\left(X’X\right)^{-1}X’e\) 与 \({\rm RSS}\) 相互独立,从而 \(\hat\beta\) 与 \({\rm RSS}\) 相互独立。

(a) 注意到

由定理 2.3.4 和定理 3.2.1 可得

(b) 注意到

根据对称幂等矩阵的秩与迹相等这一性质可得

由定理 2.4.3 可得

(c) 因为 \(\hat\beta=\beta+\left(X’X\right)^{-1}X’e\) ,而 \({\rm RSS}=e’\left(I_n-H\right)e\) ,注意到

由推论 2.4.10 可知 \(\left(X’X\right)^{-1}X’e\) 与 \({\rm RSS}\) 相互独立,从而 \(\hat\beta\) 与 \({\rm RSS}\) 相互独立。

当 \(\beta\) 的第一个分量是 \(\beta_0\) 时,取 \(c=(0,\cdots,0,1,0,\cdots,0)’\) ,其中 \(1\) 在 \(c\) 的第 \(i+1\) 个位置,则

推论 3.2.2:对于线性回归模型,若 \(e\sim N\left(0,\sigma^2I_n\right)\) ,则

(a) \(\beta_i\) 的最小二乘估计 \(\hat\beta_i\) 的分布为:

(b) 在 \(\beta_i\) 的一切线性无偏估计中,\(\hat\beta_i\) 是唯一的方差最小者,\(i=1,2,\cdots,p\) 。

推论 3.2.3:对于中心化模型,此时 \(\beta=\left(\beta_1,\beta_2,\cdots,\beta_p\right)’\) ,则有

(a) \({\rm E}\left(\hat\alpha\right)=\alpha,\,{\rm E}\left(\hat\beta\right)=\beta\) ,其中 \(\hat\alpha=\bar{y},\,\hat\beta=\left(X_c’X_c\right)^{-1}X_c’Y\) ;

(b)

(c) 若进一步假设 \(e\sim N\left(0,\sigma^2I_n\right)\) ,则

且 \(\hat\alpha\) 与 \(\hat\beta\) 相互独立。

总偏差平方和的分解:为了度量数据拟合的程度,我们在已经给出残差平方和 \({\rm RSS}\) 的定义的基础上,继续给出回归平方和 \({\rm ESS}\) 以及总偏差平方和 \({\rm TSS}\) 的定义。

  • 回归平方和:
    \[{\rm ESS}=\sum_{i=1}^n\left(\hat{y}_i-\bar{y}\right)^2=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right) \ .
    \]
  • 总偏差平方和:
    \[{\rm TSS}=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2=\left(Y-\boldsymbol{1}_n\bar{y}\right)’\left(Y-\boldsymbol{1}_n\bar{y}\right) \ .
    \]
  • 判定系数/测定系数:
    \[R^2=\frac{\rm ESS}{\rm TSS} \ .
    \]称 \(R=\sqrt{R^2}\) 为复相关系数。

为了探究 \({\rm TSS},\,{\rm ESS},\,{\rm RSS}\) 之间的关系,需要给出正规方程组的另一个等价写法。写出目标函数:

关于 \(\beta_0,\beta_1,\cdots,\beta_p\) 分别求偏导数,并令这些导函数等于 \(0\) 可得

这个方程组与 \(X’X\beta=X’Y\) 等价。由于最小二乘估计 \(\hat\beta_0,\hat\beta_1,\cdots,\hat\beta_p\) 是正规方程组的解,所以

由第一个方程可知

所以有

这就证明了总偏差平方和的分解式,即 \({\rm TSS}={\rm RSS}+{\rm ESS}\) 。我们可以基于这个公式来解释这三个平方和以之间关系,以及判定系数 \(R^2\) 的含义。

  • 若模型中没有任何自变量,即 \(y_i=\beta_0+e_i,\,i=1,2..,n\) ,可以证明 \(\bar{y}\) 就是 \(\beta_0\) 的最小二乘估计,此时 \({\rm TSS}\) 就是该模型的残差平方和。
  • 若模型中引入了自变量 \(x_1,x_2,\cdots,x_p\) ,此时的残差平方和为 \({\rm TSS}={\rm RSS}+{\rm ESS}\) 中的 \({\rm RSS}\) ,所以可以认为 \({\rm ESS}\) 衡量了在模型中引入 \(p\) 个自变量之后,残差平方和的减少量。
  • 因此我们认为 \(R^2\) 衡量了在模型中引入 \(p\) 个自变量之后,残差平方和减少的比例。也可以说,\(R^2\) 衡量了自变量 \(x_1,x_2,\cdots,x_p\) 对因变量 \(y\) 的解释能力,且有 \(0\leq R^2\leq1\) 。

定理 3.2.5:对于中心化模型,回归平方和 \({\rm ESS}\) 的计算公式为

由中心化模型可得 \(\hat{Y}=\boldsymbol 1_n\hat\alpha+X_c\hat\beta\) ,其中 \(\hat\beta=\left(\hat\beta_1,\hat\beta_2,\cdots,\hat\beta_p\right)\) ,所以有
\[\hat{Y}-\boldsymbol1_n\bar{y}=\hat{Y}-\boldsymbol1_n\hat\alpha=X_c\hat\beta \ .
\]代入 \({\rm ESS}\) 的计算公式得
\[{\rm ESS}=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)=\hat\beta’X_c’X_c\hat\beta=\hat\beta’X_c’Y \ .
\]

由中心化模型可得 \(\hat{Y}=\boldsymbol 1_n\hat\alpha+X_c\hat\beta\) ,其中 \(\hat\beta=\left(\hat\beta_1,\hat\beta_2,\cdots,\hat\beta_p\right)\) ,所以有

代入 \({\rm ESS}\) 的计算公式得

————————
  • Chapter 3: estimation of regression parameters (1) 3.1 least squares estimation 3.2 properties of least squares estimation
  • 3.1 least squares estimation
  • 3.2 properties of least squares estimation

Chapter 3: estimation of regression parameters (1)

3.1 least squares estimation

Use \ (Y \) to represent dependent variables, \ (x_1, x_2, \ cdots, x_p \) to represent \ (P \) independent variables that affect \ (Y \).

  • Overall regression model: it is assumed that the following linear relationship is satisfied between \ (Y \) and \ (x_1, x_2, \ cdots, x_p \)
    \[y=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p+e \ ,
    \]Where \ (E \) is random error, and \ (\ beta_0 \) is called regression constant, and \ (\ beta_1, \ beta_1, \ cdots, \ beta_p \) is called regression coefficient.
  • Overall regression function: quantitatively describe the dependence between the conditional mean of dependent variables and independent variables, i.e
    \[{\rm E}(y|x)=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p \ ,
    \]The primary goal of regression analysis is to estimate the regression function.

It is assumed that there are \ (n \) groups of observation samples \ (\ left (x {I1}, X {I2}, \ cdots, X {IP} \ right), \, I = 1,2, \ cdots, n \) of dependent variable \ (Y \) and independent variable \ (x {1, X}, X {P \).

  • Sample regression model: the sample observations meet the following linear equations
  • Gauss Markov hypothesis: the random error term \ (e_, I, \, I = 1,2, \ cdots, n \) satisfies the following assumptions:
    Zero mean value: \ ({\ RM e} (e_i) = 0 \);
    Homovariance: \ ({\ RM var} (e_i) = \ sigma ^ 2 \);
    Irrelevant: \ ({\ RM cov} (e_i, e_j) = 0 \, \ \ I \ NEQ J \).
  • Zero mean value: \ ({\ RM e} (e_i) = 0 \);
  • Homovariance: \ ({\ RM var} (e_i) = \ sigma ^ 2 \);
  • Irrelevant: \ ({\ RM cov} (e_i, e_j) = 0 \, \ \ I \ NEQ J \).

If the linear equations in the sample regression model are expressed in matrix form as

Where \ (x \) is called the design matrix. If the Gauss Markov hypothesis is also expressed in matrix form as

Write the matrix equation and Gauss Markov hypothesis together to obtain the most basic < strong > linear regression model < / strong >:

< strong > least squares estimation < / strong >: find an estimation of \ (\ beta \) to minimize the square of the length of the error vector \ (E = Y-X \ beta \). set up

Take the derivative of \ (\ beta \) and make it equal to zero to obtain < strong > normal equations < / strong >

The necessary and sufficient condition for a normal system of equations to have a unique solution is \ ({\ RM rank} \ left (x’x \ right) = P + 1 \), which is equivalent to \ ({\ RM rank} (x) = P + 1 \), that is, \ (x \) is full rank. The unique solution of normal equations is

The above discussion shows that \ (\ hat \ beta \) is a stagnation point of \ (Q (\ beta) \), and it is proved that \ (\ hat \ beta \) is the minimum point of \ (Q (\ beta) \).
For any \ (\ beta \ in \ mathbb {r} ^ {P + 1} \), there are
\[\begin{aligned}
\|Y-X\beta\|^2&=\left\|Y-X\hat\beta+X\left(\hat\beta-\beta\right)\right\|^2 \\ \\
&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2+2\left(\hat\beta-\beta\right)’X’\left(Y-X\hat\beta\right) \ .
\end{aligned}
\]Because \ (\ hat \ beta \) satisfies the normal equations \ (x’x \ hat \ beta = x’y \), so \ (x ‘\ left (Y-X \ hat \ beta \ right) = 0 \), for any \ (\ beta \ in \ mathbb{r} ^ {P + 1} \), there are
\[\begin{aligned}
\|Y-X\beta\|^2&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2 \ .
\end{aligned}
\]So there are
\[Q(\beta)=\|Y-X\beta\|^2\geq \left\|Y-X\hat\beta\right\|^2=Q\left(\hat\beta\right) \ .
\]The equal sign holds if and only if \ (\ beta = \ hat \ beta \).

The above discussion shows that \ (\ hat \ beta \) is a stagnation point of \ (Q (\ beta) \), and it is proved that \ (\ hat \ beta \) is the minimum point of \ (Q (\ beta) \).

For any \ (\ beta \ in \ mathbb {r} ^ {P + 1} \), there are

Because \ (\ hat \ beta \) satisfies the normal equations \ (x’x \ hat \ beta = x’y \), so \ (x ‘\ left (Y-X \ hat \ beta \ right) = 0 \), for any \ (\ beta \ in \ mathbb{r} ^ {P + 1} \), there are

So there are

The equal sign holds if and only if \ (\ beta = \ hat \ beta \).

We refer to \ (\ hat {y} = x \ hat \ beta \) as the fitting value vector or projection vector of \ (Y \)

We call \ (H = x \ left (x’x \ right) ^ {-1}x ‘\) the hat matrix, which is the projection matrix of the independent variable space, where the independent variable space refers to the column space of the matrix \ (x \). In addition, we call \ (\ hat {e} = y – \ hat {y} = (I-h) y \) the residual vector.

< strong > centralized model < / strong >: centralize the original data to make

Rewrite the sample regression model as

Where \ (\ alpha = \ beta_0 + \ beta_1 \ bar {x}_1 + \ beta_2 \ bar {x}_2 + \ cdots + \ beta_p \ bar {x}_p \). Define the design matrix as

Write the centralized model in matrix form:

Where \ (\ beta = \ left (\ beta_1, \ beta_2, \ cdots, \ beta_p \ right) ‘\). be aware

Therefore, the normal equations can be written as

The least square estimation of regression parameters is

< strong > standardization model < / strong >: standardize the original data to make

Rewrite the sample regression model as

Make \ (z = (Z {ij}) {n \ times P} \), and write the standardized model in matrix form:

The least square estimation of regression parameters is

Here, the matrix \ (Z \) has the following properties:

Where \ (R {ij} \) is the sample correlation coefficient of the independent variables \ (x_i \) and \ (x_j \), and the matrix \ (R \) is the sample correlation coefficient matrix of the independent variable.

3.2 properties of least squares estimation

Let the linear regression model satisfy the Gauss Markov hypothesis, i.e

Let’s discuss some good properties of least squares estimation \ (\ hat \ beta = \ left (x’x \ right) ^ {-1}x’y \).

< strong > theorem 3.2.1 < / strong >: for linear regression models, the least squares estimation \ (\ hat \ beta = \ left (x’x \ right) ^ {-1}x’y \) has the following properties:

(1) \({\rm E}\left(\hat\beta\right)=\beta\) 。

(2) \({\rm Cov}\left(\hat\beta\right)=\sigma^2\left(X’X\right)^{-1}\) 。

(1) 因为 \({\rm E}(Y)=X\beta\) ,所以
\[{\rm E}\left(\hat\beta\right)=\left(X’X\right)^{-1}X'{\rm E}(Y)=\left(X’X\right)^{-1}X’X\beta=\beta \ .
\](2) 因为 \({\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n\) ,所以
\[\begin{aligned}
{\rm Cov}\left(\hat\beta\right)&={\rm Cov}\left(\left(X’X\right)^{-1}X’Y\right) \\ \\
&=\left(X’X\right)^{-1}X'{\rm Cov}(Y)X\left(X’X\right)^{-1} \\ \\
&=\left(X’X\right)^{-1}X\sigma^2I_nX\left(X’X\right)^{-1} \\ \\
&=\sigma^2\left(X’X\right)^{-1} \ .
\end{aligned}
\]

(1) Because \ ({\ RM e} (y) = x \ beta \), so

(2) Because \ ({\ RM cov} (y) = {\ RM cov} (E) = \ sigma ^ 2i_n \), so

< strong > corollary 3.2.1 < / strong >: let \ (C \) be a \ (P + 1 \) dimensional constant vector, and we call \ (C ‘\ hat \ beta \) the least squares estimate of \ (C’ \ beta \), which has the following properties:

(1) \({\rm E}\left(c’\hat\beta\right)=c’\beta\) 。

(2) \({\rm Cov}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c\) 。

This inference shows that for any linear function \ (C ‘\ beta \), there is an unbiased estimate that \ (C’ \ hat \ beta \) is \ (C ‘\ beta \),

< strong > theorem 3.2.2 (Gauss Markov) < / strong >: for linear regression model, among all linear unbiased estimates of \ (C ‘\ beta \), the least squares estimate \ (C’ \ hat \ beta \) is the only best linear unbiased estimator (blue).

Assuming that \ (a’y \) is a linear unbiased estimate of \ (C ‘\ beta \), it is necessary for \ (\ forall \ beta \ in \ mathbb {r} ^ {P + 1} \)
\[{\rm E}\left(a’Y\right)=a’X\beta=c’\beta \ .
\]So \ (a’x = C ‘\). And because
\[\begin{aligned}
&{\rm Var}(a’Y)=\sigma^2a’a=\sigma^2\|a\|^2 \ , \\ \\
&{\rm Var}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c \ ,
\end{aligned}
\]It is necessary to decompose \ (\ | a \ | ^ 2 \)
\[\begin{aligned}
\|a\|^2&=\left\|a-X\left(X’X\right)^{-1}c+X\left(X’X\right)^{-1}c\right\|^2 \\ \\
&=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 +2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right) \\ \\
&=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 \ .
\end{aligned}
\]The last equal sign is because
\[\begin{aligned}
2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right)&=2c’\left(X’X\right)^{-1}\left(X’a-c\right)=0 \ .
\end{aligned}
\]Substitute the variance of \ (a’y \), so
\[\begin{aligned}
{\rm Var}\left(a’Y\right)&=\sigma^2\|a\|^2 \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2\left\|X\left(X’X\right)^{-1}c\right\|^2 \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2c’\left(X’X\right)^{-1}X’X\left(X’X\right)^{-1}c \\ \\
&=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+{\rm Var}\left(c’\hat\beta\right) \\ \\
&\geq{\rm Var}\left(c’\hat\beta\right) \ .
\end{aligned}
\]The equal sign holds if and only if \ (\ left \ | A-X \ left (x’x \ right) ^ {- 1} C \ right \ | = 0 \), that is \ (a = x \ left (x’x \ right) ^ {- 1} C \), and at this time \ (c’y = C ‘\ hat \ beta \).

Assuming that \ (a’y \) is a linear unbiased estimate of \ (C ‘\ beta \), it is necessary for \ (\ forall \ beta \ in \ mathbb {r} ^ {P + 1} \)

So \ (a’x = C ‘\). And because

It is necessary to decompose \ (\ | a \ | ^ 2 \)

The last equal sign is because

Substitute the variance of \ (a’y \), so

The equal sign holds if and only if \ (\ left \ | A-X \ left (x’x \ right) ^ {- 1} C \ right \ | = 0 \), that is \ (a = x \ left (x’x \ right) ^ {- 1} C \), and at this time \ (c’y = C ‘\ hat \ beta \).

Error variance \ (\ sigma ^ 2 \) reflects the influence of model error on dependent variables. Let’s estimate \ (\ sigma ^ 2 \).

Note that the error vector \ (E = Y-X \ beta \) is unobservable. Use \ (\ hat \ beta \) instead of \ (\ beta \), which is called

Is the residual vector. Let \ (x_i ‘\) be the \ (I \) row of the design matrix \ (x \), then the residual of the \ (I \) observation can be expressed as

Call \ (\ hat {y}_i \) the fitting value of the (I \) th observation, and \ (\ hat {y} \) the fitting value vector.

Take \ (\ hat{e} \) as an estimate of \ (E \), and define the sum of squares of residuals as

It reflects the deviation between the observed data and the regression line as a whole.

< strong > theorem 3.2.3 < / strong >: we use \ ({\ RM RSS} \) to construct an unbiased estimator of \ (\ sigma ^ 2 \).

(a) \({\rm RSS}=Y’\left(I_n-X\left(X’X\right)^{-1}X’\right)Y=Y’\left(I_n-H\right)Y\) ;

(b) If the estimator of \ (\ sigma ^ 2 \) is defined as

Then \ (\ hat \ sigma ^ 2 \) is the unbiased estimator of \ (\ sigma ^ 2 \).

(a) The hat matrix \ (\ hat {y} = hy \) is introduced, so \ (\ hat {e} = \ left (i_n-h \ right) y \), so
\[{\rm RSS}=\hat{e}’\hat{e}=Y'(I_n-H)'(I_n-H)Y=Y'(I_n-H)Y \ .
\](b) Substitute \ (y = x \ beta + e \) into the expression of \ ({\ RM RSS} \)
\[\begin{aligned}
{\rm RSS}&=(X\beta+e)'(I_n-H)(X\beta+e) \\ \\
&=\beta’X'(I_n-H)X\beta+e'(I_n-H)e \\ \\
&=\beta’X’X\beta-\beta’X’X(X’X)^{-1}X’X\beta++e'(I_n-H)e \\ \\
&=e'(I_n-H)e \ .
\end{aligned}
\]It can be seen from theorem 2.2.1
\[\begin{aligned}
{\rm E}\left({\rm RSS}\right)&={\rm E}\left[e'(I_n-H)e\right] \\ \\
&=0+{\rm tr}\left[(I_n-H)\sigma^2I_n\right] \\ \\
&=\sigma^2(n-{\rm tr}(H)) \ .
\end{aligned}
\]According to the property that the rank and trace of symmetric idempotent matrix are equal
\[{\rm tr}(H)={\rm rank}(H)={\rm rank}(X) \ .
\]So there are
\[{\rm E}\left({\rm RSS}\right)=\sigma^2(n-{\rm rank}(X)) \ .
\]Further
\[\hat\sigma^2=\frac{\rm RSS}{n-{\rm rank}(X)}
\]Is an unbiased estimator of \ (\ sigma ^ 2 \).

(a) The hat matrix \ (\ hat {y} = hy \) is introduced, so \ (\ hat {e} = \ left (i_n-h \ right) y \), so

(b) Substitute \ (y = x \ beta + e \) into the expression of \ ({\ RM RSS} \)

It can be seen from theorem 2.2.1

According to the property that the rank and trace of symmetric idempotent matrix are equal

So there are

and then

Is an unbiased estimator of \ (\ sigma ^ 2 \).

If the error vector \ (E \) obeys the normal distribution, i.e. \ (E \ SIM n_n \ left (0, \ sigma ^ 2i_n \ right) \), more properties of \ (\ hat \ beta \) and \ (\ hat \ sigma ^ 2 \) can be obtained.

< strong > theorem 3.2.4 < / strong >: for linear regression model, if the error vector \ (E \ SIM n_n \ left (0, \ sigma ^ 2i_n \ right) \), then

(a) \(\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right)\) ;

(b) \({\rm RSS}/\sigma^2\sim\chi^2(n-{\rm rank}(X))\) ;

(c) \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

(a) Note
\[\hat\beta=\left(X’X\right)^{-1}X’Y=\left(X’X\right)^{-1}X'(X\beta+e)=\beta+\left(X’X\right)^{-1}X’e \ .
\]It can be obtained from theorem 2.3.4 and theorem 3.2.1
\[\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right) \ .
\](b) Note
\[\begin{aligned}
&\frac{e}{\sigma}\sim N(0,I_n) \ , \\ \\
&\frac{\rm RSS}{\sigma^2}=\frac{e'(I_n-H)e}{\sigma^2}=\left(\frac{e}{\sigma}\right)'(I_n-H)\left(\frac{e}{\sigma}\right) \ ,
\end{aligned}
\]According to the property that the rank and trace of symmetric idempotent matrix are equal
\[{\rm rank}(I_n-H)={\rm tr}(I_n-H)=n-{\rm tr}(H)=n-{\rm rank}(H)=n-{\rm rank}(X) \ .
\]It can be obtained from theorem 2.4.3
\[\frac{\rm RSS}{\sigma^2}\sim\chi^2\left(n-{\rm rank}(X)\right) \ .
\](c) Because \ (\ hat \ beta = \ beta + \ left (x’x \ right) ^ {- 1} x’e \), and \ ({\ RM RSS} = e ‘\ left (i_n-h \ right) e \), notice that
\[\left(X’X\right)^{-1}X’\cdot\sigma^2I_n\cdot\left(I_n-H\right)=0 \ ,
\]It can be seen from inference 2.4.10 that \ (\ left (x’x \ right) ^ {-1} x’e \) and \ ({\ RM RSS} \) are independent of each other, so \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

(a) Note

It can be obtained from theorem 2.3.4 and theorem 3.2.1

(b) Note

According to the property that the rank and trace of symmetric idempotent matrix are equal

It can be obtained from theorem 2.4.3

(c) Because \ (\ hat \ beta = \ beta + \ left (x’x \ right) ^ {- 1} x’e \), and \ ({\ RM RSS} = e ‘\ left (i_n-h \ right) e \), notice that

It can be seen from inference 2.4.10 that \ (\ left (x’x \ right) ^ {-1} x’e \) and \ ({\ RM RSS} \) are independent of each other, so \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

When the first component of \ (\ beta \) is \ (\ beta_0 \), take \ (C = (0, \ cdots, 0,1,0, \ cdots, 0) ‘\), where \ (1 \) is at the \ (I + 1 \) position of \ (C \), then

< strong > inference 3.2.2 < / strong >: for linear regression model, if \ (E \ SIM n \ left (0, \ sigma ^ 2i_n \ right) \), then

(a) The distribution of the least squares estimation \ (\ hat \ beta_i \) of \ (\ beta_i \) is:

(b) In all linear unbiased estimates of \ (\ hat \ beta_i \), \ (\ hat \ beta_i \) is the only one with the smallest variance, \ (I = 1,2, \ cdots, P \).

< strong > inference 3.2.3 < / strong >: for centralized models, at this time \ (\ beta = \ left (\ beta_1, \ beta_2, \ cdots, \ beta_p \ right) ‘\), there are

(a) \({\rm E}\left(\hat\alpha\right)=\alpha,\,{\rm E}\left(\hat\beta\right)=\beta\) ,其中 \(\hat\alpha=\bar{y},\,\hat\beta=\left(X_c’X_c\right)^{-1}X_c’Y\) ;

(b)

(c) If you further assume \ (E \ SIM n \ left (0, \ sigma ^ 2i_n \ right) \), then

And \ (\ hat \ alpha \) and \ (\ hat \ beta \) are independent of each other.

< strong > decomposition of the sum of squares of total deviations < / strong >: in order to measure the degree of data fitting, we continue to define the sum of squares of regression \ ({\ RM ess} \) and the sum of squares of total deviations \ ({\ RM TSS} \) based on the definition of the sum of squares of residuals \ ({\ RM RSS} \).

  • 回归平方和:
    \[{\rm ESS}=\sum_{i=1}^n\left(\hat{y}_i-\bar{y}\right)^2=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right) \ .
    \]
  • 总偏差平方和:
    \[{\rm TSS}=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2=\left(Y-\boldsymbol{1}_n\bar{y}\right)’\left(Y-\boldsymbol{1}_n\bar{y}\right) \ .
    \]
  • Determination coefficient / determination coefficient:
    \[R^2=\frac{\rm ESS}{\rm TSS} \ .
    \]\ (r = \ sqrt{R ^ 2} \) is called the complex correlation coefficient.

In order to explore the relationship between \ ({\ RM TSS}, \, {\ RM ess}, \, {\ RM RSS} \), another equivalent formulation of normal equations needs to be given. Write out the objective function:

Find the partial derivatives of \ (\ beta_0, \ beta_1, \ cdots, \ beta_p \) and make these derivatives equal to \ (0 \)

This system of equations is equivalent to \ (x’x \ beta = x’y \). Since the least squares estimation \ (\ hat \ beta_0, \ hat \ beta_1, \ cdots, \ hat \ beta_p \) is the solution of normal equations, so

From the first equation

So there are

This proves the decomposition formula of the sum of squares of the total deviations, i.e. \ ({\ RM TSS} = {\ RM RSS} + {\ RM ess} \). Based on this formula, we can explain the relationship between the three square sums and the meaning of the decision coefficient \ (R ^ 2 \).

  • If there are no independent variables in the model, i.e. \ (y_i = \ beta_0 + e_i, \, I = 1,2.., n \), it can be proved that \ (\ bar {y} \) is the least squares estimation of \ (\ beta_0 \), and \ ({\ RM TSS} \) is the sum of squares of the residuals of the model.
  • If independent variables \ (x_1, x_2, \ cdots, x_p \) are introduced into the model, the sum of squares of residuals is \ ({\ RM RSS} \) in \ ({\ RM TSS} = {\ RM RSS} + {\ RM ess} \), so it can be considered that \ ({\ RM ess} \) measures the reduction of the sum of squares of residuals after introducing \ (P \) independent variables into the model.
  • Therefore, we believe that \ (R ^ 2 \) measures the reduction of the sum of squares of residuals after introducing \ (P \) independent variables into the model. It can also be said that \ (R ^ 2 \) measures the explanatory ability of the independent variable \ (x_1, x_2, \ cdots, x_p \) to the dependent variable \ (Y \), and has \ (0 \ Leq R ^ 2 \ leq1 \).

< strong > theorem 3.2.5 < / strong >: for the centralized model, the calculation formula of the sum of squares of regression \ ({\ RM ess} \) is

由中心化模型可得 \(\hat{Y}=\boldsymbol 1_n\hat\alpha+X_c\hat\beta\) ,其中 \(\hat\beta=\left(\hat\beta_1,\hat\beta_2,\cdots,\hat\beta_p\right)\) ,所以有
\[\hat{Y}-\boldsymbol1_n\bar{y}=\hat{Y}-\boldsymbol1_n\hat\alpha=X_c\hat\beta \ .
\]代入 \({\rm ESS}\) 的计算公式得
\[{\rm ESS}=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)=\hat\beta’X_c’X_c\hat\beta=\hat\beta’X_c’Y \ .
\]

\ (\ hat {y} = \ boldsymbol 1_n \ hat \ alpha + x_c \ hat \ beta \) can be obtained from the centralized model, where \ (\ hat \ beta = \ left (\ hat \ beta_1, \ hat \ beta_2, \ cdots, \ hat \ beta_p \ right)), so there are

Substitute into the calculation formula of \ ({\ RM ess} \)