5.3 Calculating Least Squares Linear Regression

计算最小二乘线性回归

5.3.1 最小二乘法的数学原理 / Mathematical Principles of Least Squares

最小二乘法(Least Squares Method)是一种数学优化技术,它通过最小化误差的平方和来寻找数据的最佳函数匹配。在回归分析中,我们的目标是找到一条直线,使得所有数据点到这条直线的垂直距离的平方和最小。

The least squares method is a mathematical optimization technique that finds the best function match for the data by minimizing the sum of the squares of the errors. In regression analysis, our goal is to find a line such that the sum of the squared vertical distances from all data points to this line is minimized.

目标函数:最小化残差平方和

Objective Function: Minimize Sum of Squared Residuals

\[S = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

其中,\(y_i\) 是实际观测值,\(\hat{y}_i\) 是回归直线的预测值。

Where \(y_i\) are the actual observed values and \(\hat{y}_i\) are the predicted values from the regression line.

5.3.1.1 回归直线的数学表达式

对于简单线性回归,我们假设自变量 \(x\) 和因变量 \(y\) 之间存在线性关系,可以用以下方程表示:

For simple linear regression, we assume a linear relationship between the independent variable \(x\) and dependent variable \(y\), which can be expressed by the following equation:

\[\hat{y} = a + bx\]

其中,\(a\) 是截距(intercept),\(b\) 是斜率(slope)。

Where \(a\) is the intercept and \(b\) is the slope.

注意 / Note:

回归直线总是通过点 \((\bar{x}, \bar{y})\),即自变量和因变量的平均值点。

The regression line always passes through the point \((ar{x}, ar{y})\), the point of means of the independent and dependent variables.

5.3.2 计算回归系数的公式 / Formulas for Calculating Regression Coefficients

要计算回归直线的斜率 \(b\) 和截距 \(a\),我们需要使用以下公式:

To calculate the slope \(b\) and intercept \(a\) of the regression line, we need to use the following formulas:

1. 斜率计算公式

1. Slope Calculation Formula

\[b = \frac{S_{xy}}{S_{xx}}\]

2. 截距计算公式

2. Intercept Calculation Formula

\[a = \bar{y} - b\bar{x}\]

5.3.2.1 Sxx 和 Sxy 的计算

公式中的 \(S_{xx}\) 和 \(S_{xy}\) 分别表示自变量的离均差平方和和自变量与因变量的离均差乘积和,它们的计算公式如下:

\(S_{xx}\) and \(S_{xy}\) in the formula represent the sum of squared deviations of the independent variable from its mean and the sum of products of deviations of the independent and dependent variables from their means, respectively. Their calculation formulas are as follows:

1. Sxx 计算公式

1. Sxx Calculation Formula

\[S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2\]

2. Sxy 计算公式

2. Sxy Calculation Formula

\[S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\]

注意 / Note:

\(S_{xx}\) 和 \(S_{xy}\) 也可以用以下等价公式计算:

\(S_{xx}\) and \(S_{xy}\) can also be calculated using the following equivalent formulas:

\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\]

\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\]

5.3.3 计算步骤详解 / Detailed Calculation Steps

现在,我们将详细介绍如何按照步骤计算最小二乘线性回归方程。按照以下步骤操作,可以确保计算的准确性和高效性。

Now, we will detail how to calculate the least squares linear regression equation step by step. Following these steps can ensure the accuracy and efficiency of the calculation.

计算最小二乘线性回归方程的步骤:

Steps for Calculating Least Squares Linear Regression Equation:

  1. 计算样本量 \(n\):确定数据点的个数。Calculate sample size \(n\): Determine the number of data points.
  2. 计算必要的总和:计算 \(\sum x_i\)、\(\sum y_i\)、\(\sum x_i^2\)、\(\sum y_i^2\) 和 \(\sum x_i y_i\)。Calculate necessary sums: Calculate \(\sum x_i\), \(\sum y_i\), \(\sum x_i^2\), \(\sum y_i^2\), and \(\sum x_i y_i\).
  3. 计算平均值:计算自变量和因变量的平均值 \(\bar{x} = \frac{\sum x_i}{n}\) 和 \(\bar{y} = \frac{\sum y_i}{n}\)。Calculate means: Calculate the means of the independent and dependent variables \(\bar{x} = \frac{\sum x_i}{n}\) and \(\bar{y} = \frac{\sum y_i}{n}\).
  4. 计算 \(S_{xx}\):使用公式 \(S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\)。Calculate \(S_{xx}\): Use the formula \(S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n}\).
  5. 计算 \(S_{xy}\):使用公式 \(S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\)。Calculate \(S_{xy}\): Use the formula \(S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}\).
  6. 计算斜率 \(b\):使用公式 \(b = \frac{S_{xy}}{S_{xx}}\)。Calculate slope \(b\): Use the formula \(b = \frac{S_{xy}}{S_{xx}}\).
  7. 计算截距 \(a\):使用公式 \(a = \bar{y} - b\bar{x}\)。Calculate intercept \(a\): Use the formula \(a = \bar{y} - b\bar{x}\).
  8. 写出回归方程:\(\hat{y} = a + bx\)。Write the regression equation: \(\hat{y} = a + bx\).

5.3.4 实例演示 / Example Demonstration

让我们通过一个具体的例子来演示如何计算最小二乘线性回归方程。

Let's demonstrate how to calculate the least squares linear regression equation through a specific example.

例 5.3.1 / Example 5.3.1:

研究某公司的广告支出(万元)与销售额(万元)之间的关系,收集了以下数据:

A study was conducted on the relationship between advertising expenditure (in ten thousand yuan) and sales revenue (in ten thousand yuan) for a company, and the following data were collected:

广告支出 (x) 销售额 (y)
2 14
3 18
5 22
7 30
8 32

要求计算销售额对广告支出的回归线方程。

Calculate the regression line equation of sales revenue on advertising expenditure.

解答 / Solution:

步骤 1:计算样本量 \(n = 5\)。

Step 1: Calculate sample size \(n = 5\).

步骤 2:计算必要的总和:

Step 2: Calculate necessary sums:

x y xy
2 14 4 196 28
3 18 9 324 54
5 22 25 484 110
7 30 49 900 210
8 32 64 1024 256
∑x=25 ∑y=116 ∑x²=151 ∑y²=2928 ∑xy=658

步骤 3:计算平均值:

Step 3: Calculate means:

\[\bar{x} = \frac{\sum x_i}{n} = \frac{25}{5} = 5\]

\[\bar{y} = \frac{\sum y_i}{n} = \frac{116}{5} = 23.2\]

步骤 4:计算 \(S_{xx}\):

Step 4: Calculate \(S_{xx}\):

\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 151 - \frac{25^2}{5} = 151 - 125 = 26\]

步骤 5:计算 \(S_{xy}\):

Step 5: Calculate \(S_{xy}\):

\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 658 - \frac{25 \times 116}{5} = 658 - 580 = 78\]

步骤 6:计算斜率 \(b\):

Step 6: Calculate slope \(b\):

\[b = \frac{S_{xy}}{S_{xx}} = \frac{78}{26} = 3\]

步骤 7:计算截距 \(a\):

Step 7: Calculate intercept \(a\):

\[a = \bar{y} - b\bar{x} = 23.2 - 3 \times 5 = 23.2 - 15 = 8.2\]

步骤 8:写出回归方程:

Step 8: Write the regression equation:

\[\hat{y} = 8.2 + 3x\]

解释 / Interpretation:

这个回归方程表明,广告支出每增加1万元,销售额平均增加3万元。当广告支出为0时,销售额预测为8.2万元,这可能代表公司的基础销售额。

This regression equation indicates that for each increase of 10,000 yuan in advertising expenditure, sales revenue increases by an average of 30,000 yuan. When advertising expenditure is 0, the predicted sales revenue is 82,000 yuan, which may represent the company's base sales.

例 5.3.2 / Example 5.3.2:

研究学生的学习时间(小时)与考试成绩之间的关系,收集了以下数据:

A study was conducted on the relationship between students' study time (in hours) and exam scores, and the following data were collected:

学习时间 (x) 考试成绩 (y)
2 60
3 65
4 72
5 78
6 85
7 90

要求计算考试成绩对学习时间的回归线方程。

Calculate the regression line equation of exam scores on study time.

解答 / Solution:

步骤 1:计算样本量 \(n = 6\)。

Step 1: Calculate sample size \(n = 6\).

步骤 2:计算必要的总和:

Step 2: Calculate necessary sums:

x y xy
2 60 4 3600 120
3 65 9 4225 195
4 72 16 5184 288
5 78 25 6084 390
6 85 36 7225 510
7 90 49 8100 630
∑x=27 ∑y=450 ∑x²=139 ∑y²=34418 ∑xy=2133

步骤 3:计算平均值:

Step 3: Calculate means:

\[\bar{x} = \frac{\sum x_i}{n} = \frac{27}{6} = 4.5\]

\[\bar{y} = \frac{\sum y_i}{n} = \frac{450}{6} = 75\]

步骤 4:计算 \(S_{xx}\):

Step 4: Calculate \(S_{xx}\):

\[S_{xx} = \sum x_i^2 - \frac{(\sum x_i)^2}{n} = 139 - \frac{27^2}{6} = 139 - 121.5 = 17.5\]

步骤 5:计算 \(S_{xy}\):

Step 5: Calculate \(S_{xy}\):

\[S_{xy} = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n} = 2133 - \frac{27 \times 450}{6} = 2133 - 2025 = 108\]

步骤 6:计算斜率 \(b\):

Step 6: Calculate slope \(b\):

\[b = \frac{S_{xy}}{S_{xx}} = \frac{108}{17.5} = 6.171\]

步骤 7:计算截距 \(a\):

Step 7: Calculate intercept \(a\):

\[a = \bar{y} - b\bar{x} = 75 - 6.171 \times 4.5 = 75 - 27.77 = 47.23\]

步骤 8:写出回归方程:

Step 8: Write the regression equation:

\[\hat{y} = 47.23 + 6.17x\]

解释 / Interpretation:

这个回归方程表明,学习时间每增加1小时,考试成绩平均提高约6.17分。当学习时间为0时,预测的考试成绩为47.23分,这可能代表学生的基础水平。

This regression equation indicates that for each additional hour of study time, the exam score increases by an average of approximately 6.17 points. When study time is 0, the predicted exam score is 47.23 points, which may represent the student's base level.

5.3.5 计算中的注意事项 / Notes on Calculation

注意事项 / Important Notes:

  1. 数据质量检查:在计算回归系数之前,应检查数据中是否存在异常值或错误数据,这些可能会严重影响计算结果。Data Quality Check: Before calculating regression coefficients, check for outliers or erroneous data that may significantly affect calculation results.
  2. 线性关系假设:确保自变量和因变量之间确实存在线性关系,否则回归方程可能不适用。Linear Relationship Assumption: Ensure there is indeed a linear relationship between the independent and dependent variables; otherwise, the regression equation may not be applicable.
  3. 计算精度:在手动计算过程中,保留足够的小数位数以确保结果的准确性。Calculation Precision: During manual calculations, retain sufficient decimal places to ensure result accuracy.
  4. 单位一致性:确保自变量和因变量的单位是一致的,或者在解释结果时考虑单位的影响。Unit Consistency: Ensure that the units of the independent and dependent variables are consistent, or consider the impact of units when interpreting results.
  5. 避免外推:回归方程通常只在原始数据的范围内有效,避免对超出范围的值进行预测。Avoid Extrapolation: Regression equations are usually only valid within the range of the original data; avoid making predictions for values outside this range.

实用技巧 / Practical Tips:

  • 使用表格记录计算过程中的各个步骤,有助于减少错误。
    Use tables to record each step in the calculation process to help reduce errors.
  • 计算完成后,可以验证回归直线是否通过点 \((\bar{x}, \bar{y})\),作为检查计算正确性的一种方法。
    After completing the calculation, you can verify whether the regression line passes through the point \((\bar{x}, \bar{y})\) as a way to check the correctness of the calculation.
  • 对于较大的数据集,可以使用计算器或统计软件来辅助计算,提高效率和准确性。
    For larger datasets, calculators or statistical software can be used to assist calculations, improving efficiency and accuracy.