计算最小二乘线性回归
最小二乘法(Least Squares Method)是一种数学优化技术,它通过最小化误差的平方和来寻找数据的最佳函数匹配。在回归分析中,我们的目标是找到一条直线,使得所有数据点到这条直线的垂直距离的平方和最小。
The least squares method is a mathematical optimization technique that finds the best function match for the data by minimizing the sum of the squares of the errors. In regression analysis, our goal is to find a line such that the sum of the squared vertical distances from all data points to this line is minimized.
目标函数:最小化残差平方和
Objective Function: Minimize Sum of Squared Residuals
\[S = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
其中,\(y_i\) 是实际观测值,\(\hat{y}_i\) 是回归直线的预测值。
Where \(y_i\) are the actual observed values and \(\hat{y}_i\) are the predicted values from the regression line.
对于简单线性回归,我们假设自变量 \(x\) 和因变量 \(y\) 之间存在线性关系,可以用以下方程表示:
For simple linear regression, we assume a linear relationship between the independent variable \(x\) and dependent variable \(y\), which can be expressed by the following equation:
\[\hat{y} = a + bx\]
其中,\(a\) 是截距(intercept),\(b\) 是斜率(slope)。
Where \(a\) is the intercept and \(b\) is the slope.
注意 / Note:
回归直线总是通过点 \((\bar{x}, \bar{y})\),即自变量和因变量的平均值点。
The regression line always passes through the point \((ar{x}, ar{y})\), the point of means of the independent and dependent variables.
要计算回归直线的斜率 \(b\) 和截距 \(a\),我们需要使用以下公式:
To calculate the slope \(b\) and intercept \(a\) of the regression line, we need to use the following formulas:
1. 斜率计算公式
1. Slope Calculation Formula
\[b = \frac{S_{xy}}{S_{xx}}\]
2. 截距计算公式
2. Intercept Calculation Formula
\[a = \bar{y} - b\bar{x}\]
公式中的 \(S_{xx}\) 和 \(S_{xy}\) 分别表示自变量的离均差平方和和自变量与因变量的离均差乘积和,它们的计算公式如下:
\(S_{xx}\) and \(S_{xy}\) in the formula represent the sum of squared deviations of the independent variable from its mean and the sum of products of deviations of the independent and dependent variables from their means, respectively. Their calculation formulas are as follows:
1. Sxx 计算公式
1. Sxx Calculation Formula
\[S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2\]
2. Sxy 计算公式
2. Sxy Calculation Formula
\[S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\]
注意 / Note:
\(S_{xx}\) 和 \(S_{xy}\) 也可以用以下等价公式计算:
\(S_{xx}\) and \(S_{xy}\) can also be calculated using the following equivalent formulas:
\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\]
\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\]
现在,我们将详细介绍如何按照步骤计算最小二乘线性回归方程。按照以下步骤操作,可以确保计算的准确性和高效性。
Now, we will detail how to calculate the least squares linear regression equation step by step. Following these steps can ensure the accuracy and efficiency of the calculation.
计算最小二乘线性回归方程的步骤:
Steps for Calculating Least Squares Linear Regression Equation:
让我们通过一个具体的例子来演示如何计算最小二乘线性回归方程。
Let's demonstrate how to calculate the least squares linear regression equation through a specific example.
例 5.3.1 / Example 5.3.1:
研究某公司的广告支出(万元)与销售额(万元)之间的关系,收集了以下数据:
A study was conducted on the relationship between advertising expenditure (in ten thousand yuan) and sales revenue (in ten thousand yuan) for a company, and the following data were collected:
| 广告支出 (x) | 销售额 (y) |
|---|---|
| 2 | 14 |
| 3 | 18 |
| 5 | 22 |
| 7 | 30 |
| 8 | 32 |
要求计算销售额对广告支出的回归线方程。
Calculate the regression line equation of sales revenue on advertising expenditure.
解答 / Solution:
步骤 1:计算样本量 \(n = 5\)。
Step 1: Calculate sample size \(n = 5\).
步骤 2:计算必要的总和:
Step 2: Calculate necessary sums:
| x | y | x² | y² | xy |
|---|---|---|---|---|
| 2 | 14 | 4 | 196 | 28 |
| 3 | 18 | 9 | 324 | 54 |
| 5 | 22 | 25 | 484 | 110 |
| 7 | 30 | 49 | 900 | 210 |
| 8 | 32 | 64 | 1024 | 256 |
| ∑x=25 | ∑y=116 | ∑x²=151 | ∑y²=2928 | ∑xy=658 |
步骤 3:计算平均值:
Step 3: Calculate means:
\[\bar{x} = \frac{\sum x_i}{n} = \frac{25}{5} = 5\]
\[\bar{y} = \frac{\sum y_i}{n} = \frac{116}{5} = 23.2\]
步骤 4:计算 \(S_{xx}\):
Step 4: Calculate \(S_{xx}\):
\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 151 - \frac{25^2}{5} = 151 - 125 = 26\]
步骤 5:计算 \(S_{xy}\):
Step 5: Calculate \(S_{xy}\):
\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 658 - \frac{25 \times 116}{5} = 658 - 580 = 78\]
步骤 6:计算斜率 \(b\):
Step 6: Calculate slope \(b\):
\[b = \frac{S_{xy}}{S_{xx}} = \frac{78}{26} = 3\]
步骤 7:计算截距 \(a\):
Step 7: Calculate intercept \(a\):
\[a = \bar{y} - b\bar{x} = 23.2 - 3 \times 5 = 23.2 - 15 = 8.2\]
步骤 8:写出回归方程:
Step 8: Write the regression equation:
\[\hat{y} = 8.2 + 3x\]
解释 / Interpretation:
这个回归方程表明,广告支出每增加1万元,销售额平均增加3万元。当广告支出为0时,销售额预测为8.2万元,这可能代表公司的基础销售额。
This regression equation indicates that for each increase of 10,000 yuan in advertising expenditure, sales revenue increases by an average of 30,000 yuan. When advertising expenditure is 0, the predicted sales revenue is 82,000 yuan, which may represent the company's base sales.
例 5.3.2 / Example 5.3.2:
研究学生的学习时间(小时)与考试成绩之间的关系,收集了以下数据:
A study was conducted on the relationship between students' study time (in hours) and exam scores, and the following data were collected:
| 学习时间 (x) | 考试成绩 (y) |
|---|---|
| 2 | 60 |
| 3 | 65 |
| 4 | 72 |
| 5 | 78 |
| 6 | 85 |
| 7 | 90 |
要求计算考试成绩对学习时间的回归线方程。
Calculate the regression line equation of exam scores on study time.
解答 / Solution:
步骤 1:计算样本量 \(n = 6\)。
Step 1: Calculate sample size \(n = 6\).
步骤 2:计算必要的总和:
Step 2: Calculate necessary sums:
| x | y | x² | y² | xy |
|---|---|---|---|---|
| 2 | 60 | 4 | 3600 | 120 |
| 3 | 65 | 9 | 4225 | 195 |
| 4 | 72 | 16 | 5184 | 288 |
| 5 | 78 | 25 | 6084 | 390 |
| 6 | 85 | 36 | 7225 | 510 |
| 7 | 90 | 49 | 8100 | 630 |
| ∑x=27 | ∑y=450 | ∑x²=139 | ∑y²=34418 | ∑xy=2133 |
步骤 3:计算平均值:
Step 3: Calculate means:
\[\bar{x} = \frac{\sum x_i}{n} = \frac{27}{6} = 4.5\]
\[\bar{y} = \frac{\sum y_i}{n} = \frac{450}{6} = 75\]
步骤 4:计算 \(S_{xx}\):
Step 4: Calculate \(S_{xx}\):
\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 139 - \frac{27^2}{6} = 139 - 121.5 = 17.5\]
步骤 5:计算 \(S_{xy}\):
Step 5: Calculate \(S_{xy}\):
\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 2133 - \frac{27 \times 450}{6} = 2133 - 2025 = 108\]
步骤 6:计算斜率 \(b\):
Step 6: Calculate slope \(b\):
\[b = \frac{S_{xy}}{S_{xx}} = \frac{108}{17.5} = 6.171\]
步骤 7:计算截距 \(a\):
Step 7: Calculate intercept \(a\):
\[a = \bar{y} - b\bar{x} = 75 - 6.171 \times 4.5 = 75 - 27.77 = 47.23\]
步骤 8:写出回归方程:
Step 8: Write the regression equation:
\[\hat{y} = 47.23 + 6.17x\]
解释 / Interpretation:
这个回归方程表明,学习时间每增加1小时,考试成绩平均提高约6.17分。当学习时间为0时,预测的考试成绩为47.23分,这可能代表学生的基础水平。
This regression equation indicates that for each additional hour of study time, the exam score increases by an average of approximately 6.17 points. When study time is 0, the predicted exam score is 47.23 points, which may represent the student's base level.
注意事项 / Important Notes:
实用技巧 / Practical Tips: