# 回归分析03：回归参数的估计(1)(Regression analysis 03: estimation of regression parameters (1))-其他

## 回归分析03：回归参数的估计(1)(Regression analysis 03: estimation of regression parameters (1))

• Chapter 3：回归参数的估计(1)3.1 最小二乘估计3.2 最小二乘估计的性质
• 3.1 最小二乘估计
• 3.2 最小二乘估计的性质

### 3.1 最小二乘估计

• 总体回归模型：假设 $$y$$ 和 $$x_1,x_2,\cdots,x_p$$ 之间满足如下线性关系式
$y=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p+e \ ,$其中 $$e$$ 是随机误差，将 $$\beta_0$$ 称为回归常数，将 $$\beta_1,\beta_1,\cdots,\beta_p$$ 称为回归系数。
• 总体回归函数：定量地刻画因变量的条件均值与自变量之间的相依关系，即
${\rm E}(y|x)=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p \ ,$回归分析的首要目标就是估计回归函数。

• 样本回归模型：样本观测值满足如下线性方程组
• Gauss-Markov 假设：随机误差项 $$e_i,\,i=1,2,\cdots,n$$ 满足如下假设：

零均值：$${\rm E}(e_i)=0$$ ；
同方差：$${\rm Var}(e_i)=\sigma^2$$ ；
不相关：$${\rm Cov}(e_i,e_j)=0 \ , \ \ i\neq j$$ 。

• 零均值：$${\rm E}(e_i)=0$$ ；
• 同方差：$${\rm Var}(e_i)=\sigma^2$$ ；
• 不相关：$${\rm Cov}(e_i,e_j)=0 \ , \ \ i\neq j$$ 。

\begin{aligned} \|Y-X\beta\|^2&=\left\|Y-X\hat\beta+X\left(\hat\beta-\beta\right)\right\|^2 \\ \\ &=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2+2\left(\hat\beta-\beta\right)’X’\left(Y-X\hat\beta\right) \ . \end{aligned}因为 $$\hat\beta$$ 满足正规方程组 $$X’X\hat\beta=X’Y$$ ，所以 $$X’\left(Y-X\hat\beta\right)=0$$ ，所以对任意的 $$\beta\in\mathbb{R}^{p+1}$$ ，有
\begin{aligned} \|Y-X\beta\|^2&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2 \ . \end{aligned}所以有
$Q(\beta)=\|Y-X\beta\|^2\geq \left\|Y-X\hat\beta\right\|^2=Q\left(\hat\beta\right) \ .$当且仅当 $$\beta=\hat\beta$$ 时等号成立。

### 3.2 最小二乘估计的性质

(1) $${\rm E}\left(\hat\beta\right)=\beta$$ 。

(2) $${\rm Cov}\left(\hat\beta\right)=\sigma^2\left(X’X\right)^{-1}$$ 。

(1) 因为 $${\rm E}(Y)=X\beta$$ ，所以
${\rm E}\left(\hat\beta\right)=\left(X’X\right)^{-1}X'{\rm E}(Y)=\left(X’X\right)^{-1}X’X\beta=\beta \ .$(2) 因为 $${\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n$$ ，所以
\begin{aligned} {\rm Cov}\left(\hat\beta\right)&={\rm Cov}\left(\left(X’X\right)^{-1}X’Y\right) \\ \\ &=\left(X’X\right)^{-1}X'{\rm Cov}(Y)X\left(X’X\right)^{-1} \\ \\ &=\left(X’X\right)^{-1}X\sigma^2I_nX\left(X’X\right)^{-1} \\ \\ &=\sigma^2\left(X’X\right)^{-1} \ . \end{aligned}

(1) 因为 $${\rm E}(Y)=X\beta$$ ，所以

(2) 因为 $${\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n$$ ，所以

(1) $${\rm E}\left(c’\hat\beta\right)=c’\beta$$ 。

(2) $${\rm Cov}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c$$ 。

${\rm E}\left(a’Y\right)=a’X\beta=c’\beta \ .$所以 $$a’X=c’$$ 。又因为
\begin{aligned} &{\rm Var}(a’Y)=\sigma^2a’a=\sigma^2\|a\|^2 \ , \\ \\ &{\rm Var}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c \ , \end{aligned}对 $$\|a\|^2$$ 做分解有
\begin{aligned} \|a\|^2&=\left\|a-X\left(X’X\right)^{-1}c+X\left(X’X\right)^{-1}c\right\|^2 \\ \\ &=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 +2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right) \\ \\ &=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 \ . \end{aligned}最后一个等号是因为
\begin{aligned} 2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right)&=2c’\left(X’X\right)^{-1}\left(X’a-c\right)=0 \ . \end{aligned}代入 $$a’Y$$ 的方差，所以
\begin{aligned} {\rm Var}\left(a’Y\right)&=\sigma^2\|a\|^2 \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2\left\|X\left(X’X\right)^{-1}c\right\|^2 \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2c’\left(X’X\right)^{-1}X’X\left(X’X\right)^{-1}c \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+{\rm Var}\left(c’\hat\beta\right) \\ \\ &\geq{\rm Var}\left(c’\hat\beta\right) \ . \end{aligned}等号成立当且仅当 $$\left\|a-X\left(X’X\right)^{-1}c\right\|=0$$ ，即 $$a=X\left(X’X\right)^{-1}c$$ ，此时 $$c’Y=c’\hat\beta$$ ，得证。

(a) $${\rm RSS}=Y’\left(I_n-X\left(X’X\right)^{-1}X’\right)Y=Y’\left(I_n-H\right)Y$$ ；

(b) 若定义 $$\sigma^2$$ 的估计量为

(a) 引入帽子矩阵 $$\hat{Y}=HY$$ ，所以 $$\hat{e}=\left(I_n-H\right)Y$$ ，所以
${\rm RSS}=\hat{e}’\hat{e}=Y'(I_n-H)'(I_n-H)Y=Y'(I_n-H)Y \ .$(b) 把 $$Y=X\beta+e$$ 代入 $${\rm RSS}$$ 的表达式可得
\begin{aligned} {\rm RSS}&=(X\beta+e)'(I_n-H)(X\beta+e) \\ \\ &=\beta’X'(I_n-H)X\beta+e'(I_n-H)e \\ \\ &=\beta’X’X\beta-\beta’X’X(X’X)^{-1}X’X\beta++e'(I_n-H)e \\ \\ &=e'(I_n-H)e \ . \end{aligned}由定理 2.2.1 可知
\begin{aligned} {\rm E}\left({\rm RSS}\right)&={\rm E}\left[e'(I_n-H)e\right] \\ \\ &=0+{\rm tr}\left[(I_n-H)\sigma^2I_n\right] \\ \\ &=\sigma^2(n-{\rm tr}(H)) \ . \end{aligned}根据对称幂等矩阵的秩与迹相等这一性质可得
${\rm tr}(H)={\rm rank}(H)={\rm rank}(X) \ .$所以有
${\rm E}\left({\rm RSS}\right)=\sigma^2(n-{\rm rank}(X)) \ .$进而
$\hat\sigma^2=\frac{\rm RSS}{n-{\rm rank}(X)}$是 $$\sigma^2$$ 的无偏估计量。

(a) 引入帽子矩阵 $$\hat{Y}=HY$$ ，所以 $$\hat{e}=\left(I_n-H\right)Y$$ ，所以

(b) 把 $$Y=X\beta+e$$ 代入 $${\rm RSS}$$ 的表达式可得

(a) $$\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right)$$ ；

(b) $${\rm RSS}/\sigma^2\sim\chi^2(n-{\rm rank}(X))$$ ；

(c) $$\hat\beta$$ 与 $${\rm RSS}$$ 相互独立。

(a) 注意到
$\hat\beta=\left(X’X\right)^{-1}X’Y=\left(X’X\right)^{-1}X'(X\beta+e)=\beta+\left(X’X\right)^{-1}X’e \ .$由定理 2.3.4 和定理 3.2.1 可得
$\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right) \ .$(b) 注意到
\begin{aligned} &\frac{e}{\sigma}\sim N(0,I_n) \ , \\ \\ &\frac{\rm RSS}{\sigma^2}=\frac{e'(I_n-H)e}{\sigma^2}=\left(\frac{e}{\sigma}\right)'(I_n-H)\left(\frac{e}{\sigma}\right) \ , \end{aligned}根据对称幂等矩阵的秩与迹相等这一性质可得
${\rm rank}(I_n-H)={\rm tr}(I_n-H)=n-{\rm tr}(H)=n-{\rm rank}(H)=n-{\rm rank}(X) \ .$由定理 2.4.3 可得
$\frac{\rm RSS}{\sigma^2}\sim\chi^2\left(n-{\rm rank}(X)\right) \ .$(c) 因为 $$\hat\beta=\beta+\left(X’X\right)^{-1}X’e$$ ，而 $${\rm RSS}=e’\left(I_n-H\right)e$$ ，注意到
$\left(X’X\right)^{-1}X’\cdot\sigma^2I_n\cdot\left(I_n-H\right)=0 \ ,$由推论 2.4.10 可知 $$\left(X’X\right)^{-1}X’e$$ 与 $${\rm RSS}$$ 相互独立，从而 $$\hat\beta$$ 与 $${\rm RSS}$$ 相互独立。

(a) 注意到

(b) 注意到

(c) 因为 $$\hat\beta=\beta+\left(X’X\right)^{-1}X’e$$ ，而 $${\rm RSS}=e’\left(I_n-H\right)e$$ ，注意到

(a) $$\beta_i$$ 的最小二乘估计 $$\hat\beta_i$$ 的分布为：

(b) 在 $$\beta_i$$ 的一切线性无偏估计中，$$\hat\beta_i$$ 是唯一的方差最小者，$$i=1,2,\cdots,p$$ 。

(a) $${\rm E}\left(\hat\alpha\right)=\alpha,\,{\rm E}\left(\hat\beta\right)=\beta$$ ，其中 $$\hat\alpha=\bar{y},\,\hat\beta=\left(X_c’X_c\right)^{-1}X_c’Y$$ ；

(b)

(c) 若进一步假设 $$e\sim N\left(0,\sigma^2I_n\right)$$ ，则

• 回归平方和：
${\rm ESS}=\sum_{i=1}^n\left(\hat{y}_i-\bar{y}\right)^2=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right) \ .$
• 总偏差平方和：
${\rm TSS}=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2=\left(Y-\boldsymbol{1}_n\bar{y}\right)’\left(Y-\boldsymbol{1}_n\bar{y}\right) \ .$
• 判定系数/测定系数：
$R^2=\frac{\rm ESS}{\rm TSS} \ .$称 $$R=\sqrt{R^2}$$ 为复相关系数。

• 若模型中没有任何自变量，即 $$y_i=\beta_0+e_i,\,i=1,2..,n$$ ，可以证明 $$\bar{y}$$ 就是 $$\beta_0$$ 的最小二乘估计，此时 $${\rm TSS}$$ 就是该模型的残差平方和。
• 若模型中引入了自变量 $$x_1,x_2,\cdots,x_p$$ ，此时的残差平方和为 $${\rm TSS}={\rm RSS}+{\rm ESS}$$ 中的 $${\rm RSS}$$ ，所以可以认为 $${\rm ESS}$$ 衡量了在模型中引入 $$p$$ 个自变量之后，残差平方和的减少量。
• 因此我们认为 $$R^2$$ 衡量了在模型中引入 $$p$$ 个自变量之后，残差平方和减少的比例。也可以说，$$R^2$$ 衡量了自变量 $$x_1,x_2,\cdots,x_p$$ 对因变量 $$y$$ 的解释能力，且有 $$0\leq R^2\leq1$$ 。

$\hat{Y}-\boldsymbol1_n\bar{y}=\hat{Y}-\boldsymbol1_n\hat\alpha=X_c\hat\beta \ .$代入 $${\rm ESS}$$ 的计算公式得
${\rm ESS}=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)=\hat\beta’X_c’X_c\hat\beta=\hat\beta’X_c’Y \ .$

————————
• Chapter 3: estimation of regression parameters (1) 3.1 least squares estimation 3.2 properties of least squares estimation
• 3.1 least squares estimation
• 3.2 properties of least squares estimation

### 3.1 least squares estimation

Use \ (Y \) to represent dependent variables, \ (x_1, x_2, \ cdots, x_p \) to represent \ (P \) independent variables that affect \ (Y \).

• Overall regression model: it is assumed that the following linear relationship is satisfied between \ (Y \) and \ (x_1, x_2, \ cdots, x_p \)
$y=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p+e \ ,$Where \ (E \) is random error, and \ (\ beta_0 \) is called regression constant, and \ (\ beta_1, \ beta_1, \ cdots, \ beta_p \) is called regression coefficient.
• Overall regression function: quantitatively describe the dependence between the conditional mean of dependent variables and independent variables, i.e
${\rm E}(y|x)=\beta_0+\beta_1 x_1+\beta_2x_2+\cdots+\beta_px_p \ ,$The primary goal of regression analysis is to estimate the regression function.

It is assumed that there are \ (n \) groups of observation samples \ (\ left (x {I1}, X {I2}, \ cdots, X {IP} \ right), \, I = 1,2, \ cdots, n \) of dependent variable \ (Y \) and independent variable \ (x {1, X}, X {P \).

• Sample regression model: the sample observations meet the following linear equations
• Gauss Markov hypothesis: the random error term \ (e_, I, \, I = 1,2, \ cdots, n \) satisfies the following assumptions:
Zero mean value: \ ({\ RM e} (e_i) = 0 \);
Homovariance: \ ({\ RM var} (e_i) = \ sigma ^ 2 \);
Irrelevant: \ ({\ RM cov} (e_i, e_j) = 0 \, \ \ I \ NEQ J \).
• Zero mean value: \ ({\ RM e} (e_i) = 0 \);
• Homovariance: \ ({\ RM var} (e_i) = \ sigma ^ 2 \);
• Irrelevant: \ ({\ RM cov} (e_i, e_j) = 0 \, \ \ I \ NEQ J \).

If the linear equations in the sample regression model are expressed in matrix form as

Where \ (x \) is called the design matrix. If the Gauss Markov hypothesis is also expressed in matrix form as

Write the matrix equation and Gauss Markov hypothesis together to obtain the most basic < strong > linear regression model < / strong >:

< strong > least squares estimation < / strong >: find an estimation of \ (\ beta \) to minimize the square of the length of the error vector \ (E = Y-X \ beta \). set up

Take the derivative of \ (\ beta \) and make it equal to zero to obtain < strong > normal equations < / strong >

The necessary and sufficient condition for a normal system of equations to have a unique solution is \ ({\ RM rank} \ left (x’x \ right) = P + 1 \), which is equivalent to \ ({\ RM rank} (x) = P + 1 \), that is, \ (x \) is full rank. The unique solution of normal equations is

The above discussion shows that \ (\ hat \ beta \) is a stagnation point of \ (Q (\ beta) \), and it is proved that \ (\ hat \ beta \) is the minimum point of \ (Q (\ beta) \).
For any \ (\ beta \ in \ mathbb {r} ^ {P + 1} \), there are
\begin{aligned} \|Y-X\beta\|^2&=\left\|Y-X\hat\beta+X\left(\hat\beta-\beta\right)\right\|^2 \\ \\ &=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2+2\left(\hat\beta-\beta\right)’X’\left(Y-X\hat\beta\right) \ . \end{aligned}Because \ (\ hat \ beta \) satisfies the normal equations \ (x’x \ hat \ beta = x’y \), so \ (x ‘\ left (Y-X \ hat \ beta \ right) = 0 \), for any \ (\ beta \ in \ mathbb{r} ^ {P + 1} \), there are
\begin{aligned} \|Y-X\beta\|^2&=\left\|Y-X\hat\beta\right\|^2+\left\|X\left(\hat\beta-\beta\right)\right\|^2 \ . \end{aligned}So there are
$Q(\beta)=\|Y-X\beta\|^2\geq \left\|Y-X\hat\beta\right\|^2=Q\left(\hat\beta\right) \ .$The equal sign holds if and only if \ (\ beta = \ hat \ beta \).

The above discussion shows that \ (\ hat \ beta \) is a stagnation point of \ (Q (\ beta) \), and it is proved that \ (\ hat \ beta \) is the minimum point of \ (Q (\ beta) \).

For any \ (\ beta \ in \ mathbb {r} ^ {P + 1} \), there are

Because \ (\ hat \ beta \) satisfies the normal equations \ (x’x \ hat \ beta = x’y \), so \ (x ‘\ left (Y-X \ hat \ beta \ right) = 0 \), for any \ (\ beta \ in \ mathbb{r} ^ {P + 1} \), there are

So there are

The equal sign holds if and only if \ (\ beta = \ hat \ beta \).

We refer to \ (\ hat {y} = x \ hat \ beta \) as the fitting value vector or projection vector of \ (Y \)

We call \ (H = x \ left (x’x \ right) ^ {-1}x ‘\) the hat matrix, which is the projection matrix of the independent variable space, where the independent variable space refers to the column space of the matrix \ (x \). In addition, we call \ (\ hat {e} = y – \ hat {y} = (I-h) y \) the residual vector.

< strong > centralized model < / strong >: centralize the original data to make

Rewrite the sample regression model as

Where \ (\ alpha = \ beta_0 + \ beta_1 \ bar {x}_1 + \ beta_2 \ bar {x}_2 + \ cdots + \ beta_p \ bar {x}_p \). Define the design matrix as

Write the centralized model in matrix form:

Where \ (\ beta = \ left (\ beta_1, \ beta_2, \ cdots, \ beta_p \ right) ‘\). be aware

Therefore, the normal equations can be written as

The least square estimation of regression parameters is

< strong > standardization model < / strong >: standardize the original data to make

Rewrite the sample regression model as

Make \ (z = (Z {ij}) {n \ times P} \), and write the standardized model in matrix form:

The least square estimation of regression parameters is

Here, the matrix \ (Z \) has the following properties:

Where \ (R {ij} \) is the sample correlation coefficient of the independent variables \ (x_i \) and \ (x_j \), and the matrix \ (R \) is the sample correlation coefficient matrix of the independent variable.

### 3.2 properties of least squares estimation

Let the linear regression model satisfy the Gauss Markov hypothesis, i.e

Let’s discuss some good properties of least squares estimation \ (\ hat \ beta = \ left (x’x \ right) ^ {-1}x’y \).

< strong > theorem 3.2.1 < / strong >: for linear regression models, the least squares estimation \ (\ hat \ beta = \ left (x’x \ right) ^ {-1}x’y \) has the following properties:

(1) $${\rm E}\left(\hat\beta\right)=\beta$$ 。

(2) $${\rm Cov}\left(\hat\beta\right)=\sigma^2\left(X’X\right)^{-1}$$ 。

(1) 因为 $${\rm E}(Y)=X\beta$$ ，所以
${\rm E}\left(\hat\beta\right)=\left(X’X\right)^{-1}X'{\rm E}(Y)=\left(X’X\right)^{-1}X’X\beta=\beta \ .$(2) 因为 $${\rm Cov}(Y)={\rm Cov}(e)=\sigma^2I_n$$ ，所以
\begin{aligned} {\rm Cov}\left(\hat\beta\right)&={\rm Cov}\left(\left(X’X\right)^{-1}X’Y\right) \\ \\ &=\left(X’X\right)^{-1}X'{\rm Cov}(Y)X\left(X’X\right)^{-1} \\ \\ &=\left(X’X\right)^{-1}X\sigma^2I_nX\left(X’X\right)^{-1} \\ \\ &=\sigma^2\left(X’X\right)^{-1} \ . \end{aligned}

(1) Because \ ({\ RM e} (y) = x \ beta \), so

(2) Because \ ({\ RM cov} (y) = {\ RM cov} (E) = \ sigma ^ 2i_n \), so

< strong > corollary 3.2.1 < / strong >: let \ (C \) be a \ (P + 1 \) dimensional constant vector, and we call \ (C ‘\ hat \ beta \) the least squares estimate of \ (C’ \ beta \), which has the following properties:

(1) $${\rm E}\left(c’\hat\beta\right)=c’\beta$$ 。

(2) $${\rm Cov}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c$$ 。

This inference shows that for any linear function \ (C ‘\ beta \), there is an unbiased estimate that \ (C’ \ hat \ beta \) is \ (C ‘\ beta \),

< strong > theorem 3.2.2 (Gauss Markov) < / strong >: for linear regression model, among all linear unbiased estimates of \ (C ‘\ beta \), the least squares estimate \ (C’ \ hat \ beta \) is the only best linear unbiased estimator (blue).

Assuming that \ (a’y \) is a linear unbiased estimate of \ (C ‘\ beta \), it is necessary for \ (\ forall \ beta \ in \ mathbb {r} ^ {P + 1} \)
${\rm E}\left(a’Y\right)=a’X\beta=c’\beta \ .$So \ (a’x = C ‘\). And because
\begin{aligned} &{\rm Var}(a’Y)=\sigma^2a’a=\sigma^2\|a\|^2 \ , \\ \\ &{\rm Var}\left(c’\hat\beta\right)=\sigma^2c’\left(X’X\right)^{-1}c \ , \end{aligned}It is necessary to decompose \ (\ | a \ | ^ 2 \)
\begin{aligned} \|a\|^2&=\left\|a-X\left(X’X\right)^{-1}c+X\left(X’X\right)^{-1}c\right\|^2 \\ \\ &=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 +2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right) \\ \\ &=\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\left\|X\left(X’X\right)^{-1}c\right\|^2 \ . \end{aligned}The last equal sign is because
\begin{aligned} 2c’\left(X’X\right)^{-1}X’\left(a-X\left(X’X\right)^{-1}c\right)&=2c’\left(X’X\right)^{-1}\left(X’a-c\right)=0 \ . \end{aligned}Substitute the variance of \ (a’y \), so
\begin{aligned} {\rm Var}\left(a’Y\right)&=\sigma^2\|a\|^2 \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2\left\|X\left(X’X\right)^{-1}c\right\|^2 \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+\sigma^2c’\left(X’X\right)^{-1}X’X\left(X’X\right)^{-1}c \\ \\ &=\sigma^2\left\|a-X\left(X’X\right)^{-1}c\right\|^2+{\rm Var}\left(c’\hat\beta\right) \\ \\ &\geq{\rm Var}\left(c’\hat\beta\right) \ . \end{aligned}The equal sign holds if and only if \ (\ left \ | A-X \ left (x’x \ right) ^ {- 1} C \ right \ | = 0 \), that is \ (a = x \ left (x’x \ right) ^ {- 1} C \), and at this time \ (c’y = C ‘\ hat \ beta \).

Assuming that \ (a’y \) is a linear unbiased estimate of \ (C ‘\ beta \), it is necessary for \ (\ forall \ beta \ in \ mathbb {r} ^ {P + 1} \)

So \ (a’x = C ‘\). And because

It is necessary to decompose \ (\ | a \ | ^ 2 \)

The last equal sign is because

Substitute the variance of \ (a’y \), so

The equal sign holds if and only if \ (\ left \ | A-X \ left (x’x \ right) ^ {- 1} C \ right \ | = 0 \), that is \ (a = x \ left (x’x \ right) ^ {- 1} C \), and at this time \ (c’y = C ‘\ hat \ beta \).

Error variance \ (\ sigma ^ 2 \) reflects the influence of model error on dependent variables. Let’s estimate \ (\ sigma ^ 2 \).

Note that the error vector \ (E = Y-X \ beta \) is unobservable. Use \ (\ hat \ beta \) instead of \ (\ beta \), which is called

Is the residual vector. Let \ (x_i ‘\) be the \ (I \) row of the design matrix \ (x \), then the residual of the \ (I \) observation can be expressed as

Call \ (\ hat {y}_i \) the fitting value of the (I \) th observation, and \ (\ hat {y} \) the fitting value vector.

Take \ (\ hat{e} \) as an estimate of \ (E \), and define the sum of squares of residuals as

It reflects the deviation between the observed data and the regression line as a whole.

< strong > theorem 3.2.3 < / strong >: we use \ ({\ RM RSS} \) to construct an unbiased estimator of \ (\ sigma ^ 2 \).

(a) $${\rm RSS}=Y’\left(I_n-X\left(X’X\right)^{-1}X’\right)Y=Y’\left(I_n-H\right)Y$$ ；

(b) If the estimator of \ (\ sigma ^ 2 \) is defined as

Then \ (\ hat \ sigma ^ 2 \) is the unbiased estimator of \ (\ sigma ^ 2 \).

(a) The hat matrix \ (\ hat {y} = hy \) is introduced, so \ (\ hat {e} = \ left (i_n-h \ right) y \), so
${\rm RSS}=\hat{e}’\hat{e}=Y'(I_n-H)'(I_n-H)Y=Y'(I_n-H)Y \ .$(b) Substitute \ (y = x \ beta + e \) into the expression of \ ({\ RM RSS} \)
\begin{aligned} {\rm RSS}&=(X\beta+e)'(I_n-H)(X\beta+e) \\ \\ &=\beta’X'(I_n-H)X\beta+e'(I_n-H)e \\ \\ &=\beta’X’X\beta-\beta’X’X(X’X)^{-1}X’X\beta++e'(I_n-H)e \\ \\ &=e'(I_n-H)e \ . \end{aligned}It can be seen from theorem 2.2.1
\begin{aligned} {\rm E}\left({\rm RSS}\right)&={\rm E}\left[e'(I_n-H)e\right] \\ \\ &=0+{\rm tr}\left[(I_n-H)\sigma^2I_n\right] \\ \\ &=\sigma^2(n-{\rm tr}(H)) \ . \end{aligned}According to the property that the rank and trace of symmetric idempotent matrix are equal
${\rm tr}(H)={\rm rank}(H)={\rm rank}(X) \ .$So there are
${\rm E}\left({\rm RSS}\right)=\sigma^2(n-{\rm rank}(X)) \ .$Further
$\hat\sigma^2=\frac{\rm RSS}{n-{\rm rank}(X)}$Is an unbiased estimator of \ (\ sigma ^ 2 \).

(a) The hat matrix \ (\ hat {y} = hy \) is introduced, so \ (\ hat {e} = \ left (i_n-h \ right) y \), so

(b) Substitute \ (y = x \ beta + e \) into the expression of \ ({\ RM RSS} \)

It can be seen from theorem 2.2.1

According to the property that the rank and trace of symmetric idempotent matrix are equal

So there are

and then

Is an unbiased estimator of \ (\ sigma ^ 2 \).

If the error vector \ (E \) obeys the normal distribution, i.e. \ (E \ SIM n_n \ left (0, \ sigma ^ 2i_n \ right) \), more properties of \ (\ hat \ beta \) and \ (\ hat \ sigma ^ 2 \) can be obtained.

< strong > theorem 3.2.4 < / strong >: for linear regression model, if the error vector \ (E \ SIM n_n \ left (0, \ sigma ^ 2i_n \ right) \), then

(a) $$\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right)$$ ；

(b) $${\rm RSS}/\sigma^2\sim\chi^2(n-{\rm rank}(X))$$ ；

(c) \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

(a) Note
$\hat\beta=\left(X’X\right)^{-1}X’Y=\left(X’X\right)^{-1}X'(X\beta+e)=\beta+\left(X’X\right)^{-1}X’e \ .$It can be obtained from theorem 2.3.4 and theorem 3.2.1
$\hat\beta\sim N\left(\beta,\sigma^2\left(X’X\right)^{-1}\right) \ .$(b) Note
\begin{aligned} &\frac{e}{\sigma}\sim N(0,I_n) \ , \\ \\ &\frac{\rm RSS}{\sigma^2}=\frac{e'(I_n-H)e}{\sigma^2}=\left(\frac{e}{\sigma}\right)'(I_n-H)\left(\frac{e}{\sigma}\right) \ , \end{aligned}According to the property that the rank and trace of symmetric idempotent matrix are equal
${\rm rank}(I_n-H)={\rm tr}(I_n-H)=n-{\rm tr}(H)=n-{\rm rank}(H)=n-{\rm rank}(X) \ .$It can be obtained from theorem 2.4.3
$\frac{\rm RSS}{\sigma^2}\sim\chi^2\left(n-{\rm rank}(X)\right) \ .$(c) Because \ (\ hat \ beta = \ beta + \ left (x’x \ right) ^ {- 1} x’e \), and \ ({\ RM RSS} = e ‘\ left (i_n-h \ right) e \), notice that
$\left(X’X\right)^{-1}X’\cdot\sigma^2I_n\cdot\left(I_n-H\right)=0 \ ,$It can be seen from inference 2.4.10 that \ (\ left (x’x \ right) ^ {-1} x’e \) and \ ({\ RM RSS} \) are independent of each other, so \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

(a) Note

It can be obtained from theorem 2.3.4 and theorem 3.2.1

(b) Note

According to the property that the rank and trace of symmetric idempotent matrix are equal

It can be obtained from theorem 2.4.3

(c) Because \ (\ hat \ beta = \ beta + \ left (x’x \ right) ^ {- 1} x’e \), and \ ({\ RM RSS} = e ‘\ left (i_n-h \ right) e \), notice that

It can be seen from inference 2.4.10 that \ (\ left (x’x \ right) ^ {-1} x’e \) and \ ({\ RM RSS} \) are independent of each other, so \ (\ hat \ beta \) and \ ({\ RM RSS} \) are independent of each other.

When the first component of \ (\ beta \) is \ (\ beta_0 \), take \ (C = (0, \ cdots, 0,1,0, \ cdots, 0) ‘\), where \ (1 \) is at the \ (I + 1 \) position of \ (C \), then

< strong > inference 3.2.2 < / strong >: for linear regression model, if \ (E \ SIM n \ left (0, \ sigma ^ 2i_n \ right) \), then

(a) The distribution of the least squares estimation \ (\ hat \ beta_i \) of \ (\ beta_i \) is:

(b) In all linear unbiased estimates of \ (\ hat \ beta_i \), \ (\ hat \ beta_i \) is the only one with the smallest variance, \ (I = 1,2, \ cdots, P \).

< strong > inference 3.2.3 < / strong >: for centralized models, at this time \ (\ beta = \ left (\ beta_1, \ beta_2, \ cdots, \ beta_p \ right) ‘\), there are

(a) $${\rm E}\left(\hat\alpha\right)=\alpha,\,{\rm E}\left(\hat\beta\right)=\beta$$ ，其中 $$\hat\alpha=\bar{y},\,\hat\beta=\left(X_c’X_c\right)^{-1}X_c’Y$$ ；

(b)

(c) If you further assume \ (E \ SIM n \ left (0, \ sigma ^ 2i_n \ right) \), then

And \ (\ hat \ alpha \) and \ (\ hat \ beta \) are independent of each other.

< strong > decomposition of the sum of squares of total deviations < / strong >: in order to measure the degree of data fitting, we continue to define the sum of squares of regression \ ({\ RM ess} \) and the sum of squares of total deviations \ ({\ RM TSS} \) based on the definition of the sum of squares of residuals \ ({\ RM RSS} \).

• 回归平方和：
${\rm ESS}=\sum_{i=1}^n\left(\hat{y}_i-\bar{y}\right)^2=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right) \ .$
• 总偏差平方和：
${\rm TSS}=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2=\left(Y-\boldsymbol{1}_n\bar{y}\right)’\left(Y-\boldsymbol{1}_n\bar{y}\right) \ .$
• Determination coefficient / determination coefficient:
$R^2=\frac{\rm ESS}{\rm TSS} \ .$\ (r = \ sqrt{R ^ 2} \) is called the complex correlation coefficient.

In order to explore the relationship between \ ({\ RM TSS}, \, {\ RM ess}, \, {\ RM RSS} \), another equivalent formulation of normal equations needs to be given. Write out the objective function:

Find the partial derivatives of \ (\ beta_0, \ beta_1, \ cdots, \ beta_p \) and make these derivatives equal to \ (0 \)

This system of equations is equivalent to \ (x’x \ beta = x’y \). Since the least squares estimation \ (\ hat \ beta_0, \ hat \ beta_1, \ cdots, \ hat \ beta_p \) is the solution of normal equations, so

From the first equation

So there are

This proves the decomposition formula of the sum of squares of the total deviations, i.e. \ ({\ RM TSS} = {\ RM RSS} + {\ RM ess} \). Based on this formula, we can explain the relationship between the three square sums and the meaning of the decision coefficient \ (R ^ 2 \).

• If there are no independent variables in the model, i.e. \ (y_i = \ beta_0 + e_i, \, I = 1,2.., n \), it can be proved that \ (\ bar {y} \) is the least squares estimation of \ (\ beta_0 \), and \ ({\ RM TSS} \) is the sum of squares of the residuals of the model.
• If independent variables \ (x_1, x_2, \ cdots, x_p \) are introduced into the model, the sum of squares of residuals is \ ({\ RM RSS} \) in \ ({\ RM TSS} = {\ RM RSS} + {\ RM ess} \), so it can be considered that \ ({\ RM ess} \) measures the reduction of the sum of squares of residuals after introducing \ (P \) independent variables into the model.
• Therefore, we believe that \ (R ^ 2 \) measures the reduction of the sum of squares of residuals after introducing \ (P \) independent variables into the model. It can also be said that \ (R ^ 2 \) measures the explanatory ability of the independent variable \ (x_1, x_2, \ cdots, x_p \) to the dependent variable \ (Y \), and has \ (0 \ Leq R ^ 2 \ leq1 \).

< strong > theorem 3.2.5 < / strong >: for the centralized model, the calculation formula of the sum of squares of regression \ ({\ RM ess} \) is

$\hat{Y}-\boldsymbol1_n\bar{y}=\hat{Y}-\boldsymbol1_n\hat\alpha=X_c\hat\beta \ .$代入 $${\rm ESS}$$ 的计算公式得
${\rm ESS}=\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)’\left(\hat{Y}-\boldsymbol{1}_n\bar{y}\right)=\hat\beta’X_c’X_c\hat\beta=\hat\beta’X_c’Y \ .$

\ (\ hat {y} = \ boldsymbol 1_n \ hat \ alpha + x_c \ hat \ beta \) can be obtained from the centralized model, where \ (\ hat \ beta = \ left (\ hat \ beta_1, \ hat \ beta_2, \ cdots, \ hat \ beta_p \ right)), so there are

Substitute into the calculation formula of \ ({\ RM ess} \)