Chapter Review 5

第五章 相关与回归 - 复习练习题

复习练习题

练习题1:散点图和相关分析
以下数据显示了10名学生每周学习时间(小时)和他们的考试成绩:
The following data shows the weekly study hours and exam scores for 10 students:
学生编号 学习时间 (x) 考试成绩 (y)
1 5 65
2 8 72
3 12 85
4 6 68
5 10 80
6 7 74
7 15 92
8 9 78
9 11 88
10 4 60
  1. 绘制散点图并描述变量间的关系模式
  2. 计算积矩相关系数r
  3. 解释相关系数的含义和强度
  1. Draw a scatter diagram and describe the relationship pattern
  2. Calculate the product moment correlation coefficient r
  3. Interpret the meaning and strength of the correlation coefficient
答案解析

1. 散点图分析

通过观察数据点分布,可以看出学习时间和考试成绩之间存在正相关关系。随着学习时间的增加,考试成绩通常也会提高。

2. 计算积矩相关系数r

首先,计算必要的统计量:

  • n = 10
  • Σx = 5+8+12+6+10+7+15+9+11+4 = 87
  • Σy = 65+72+85+68+80+74+92+78+88+60 = 762
  • Σxy = (5×65)+(8×72)+(12×85)+(6×68)+(10×80)+(7×74)+(15×92)+(9×78)+(11×88)+(4×60) = 6879
  • Σx² = 5²+8²+12²+6²+10²+7²+15²+9²+11²+4² = 859
  • Σy² = 65²+72²+85²+68²+80²+74²+92²+78²+88²+60² = 58674

1. Scatter Diagram Analysis

By observing the distribution of data points, we can see a positive correlation between study time and exam scores. As study time increases, exam scores generally increase.

2. Calculate Product Moment Correlation Coefficient r

First, calculate the necessary statistics:

  • n = 10
  • Σx = 5+8+12+6+10+7+15+9+11+4 = 87
  • Σy = 65+72+85+68+80+74+92+78+88+60 = 762
  • Σxy = (5×65)+(8×72)+(12×85)+(6×68)+(10×80)+(7×74)+(15×92)+(9×78)+(11×88)+(4×60) = 6879
  • Σx² = 5²+8²+12²+6²+10²+7²+15²+9²+11²+4² = 859
  • Σy² = 65²+72²+85²+68²+80²+74²+92²+78²+88²+60² = 58674

计算相关系数r:

\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]

\[r = \frac{10\times6879 - 87\times762}{\sqrt{[10\times859 - 87^2][10\times58674 - 762^2]}}\]

\[r = \frac{68790 - 66294}{\sqrt{[8590 - 7569][586740 - 580644]}}\]

\[r = \frac{2496}{\sqrt{1021 \times 6096}}\]

\[r = \frac{2496}{\sqrt{6224016}}\]

\[r = \frac{2496}{2494.8}\]

\[r \approx 0.999\]

3. 相关系数解释

相关系数r ≈ 0.999,这表明学习时间和考试成绩之间存在极强的正线性关系。相关系数非常接近1,说明两个变量几乎完全正相关。这意味着学习时间增加,考试成绩很可能会提高,反之亦然。

3. Interpretation of Correlation Coefficient

The correlation coefficient r ≈ 0.999 indicates an extremely strong positive linear relationship between study time and exam scores. The correlation coefficient is very close to 1, suggesting that the two variables are almost perfectly positively correlated. This means that as study time increases, exam scores are very likely to increase, and vice versa.

练习题2:线性回归分析
某公司记录了每月广告费(千元)和销售额(万元)的数据如下:
A company recorded the following data on monthly advertising expenses (in thousand yuan) and sales revenue (in ten thousand yuan):
月份 广告费 (x) 销售额 (y)
1 2 5
2 3 7
3 4 8
4 5 10
5 6 12
6 7 14
  1. 计算最小二乘回归直线方程 y = a + bx
  2. 解释回归系数b的含义
  3. 如果下个月计划投入9千元广告费,预测销售额
  1. Calculate the least squares regression line equation y = a + bx
  2. Interpret the meaning of the regression coefficient b
  3. Estimate the sales revenue if the company plans to spend 9 thousand yuan on advertising next month
答案解析

1. 计算回归直线方程

首先,计算必要的统计量:

  • n = 6
  • Σx = 2+3+4+5+6+7 = 27
  • Σy = 5+7+8+10+12+14 = 56
  • Σxy = (2×5)+(3×7)+(4×8)+(5×10)+(6×12)+(7×14) = 282
  • Σx² = 2²+3²+4²+5²+6²+7² = 139

1. Calculate Regression Line Equation

First, calculate the necessary statistics:

  • n = 6
  • Σx = 2+3+4+5+6+7 = 27
  • Σy = 5+7+8+10+12+14 = 56
  • Σxy = (2×5)+(3×7)+(4×8)+(5×10)+(6×12)+(7×14) = 282
  • Σx² = 2²+3²+4²+5²+6²+7² = 139

计算斜率b:

\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]

\[b = \frac{6\times282 - 27\times56}{6\times139 - 27^2}\]

\[b = \frac{1692 - 1512}{834 - 729}\]

\[b = \frac{180}{105}\]

\[b = 1.7143\]

计算截距a:

\[a = \bar{y} - b\bar{x}\]

\[a = \frac{56}{6} - 1.7143\times\frac{27}{6}\]

\[a = 9.3333 - 1.7143\times4.5\]

\[a = 9.3333 - 7.7144\]

\[a = 1.6189\]

因此,回归直线方程为:\[y = 1.6189 + 1.7143x\]

2. 回归系数b的解释

回归系数b = 1.7143 表示广告费每增加1千元,销售额平均增加1.7143万元。这表明广告费对销售额有显著的正向影响。

3. 预测销售额

当x = 9千元时,代入回归方程:

y = 1.6189 + 1.7143 × 9 = 1.6189 + 15.4287 = 17.0476万元

因此,预测销售额约为17.05万元。

2. Interpretation of Regression Coefficient b

The regression coefficient b = 1.7143 indicates that for each 1 thousand yuan increase in advertising expenses, sales revenue increases by an average of 1.7143 ten thousand yuan. This shows that advertising expenses have a significant positive impact on sales revenue.

3. Sales Revenue Prediction

When x = 9 thousand yuan, substituting into the regression equation:

y = 1.6189 + 1.7143 × 9 = 1.6189 + 15.4287 = 17.0476 ten thousand yuan

Therefore, the predicted sales revenue is approximately 17.05 ten thousand yuan.

练习题3:相关与因果关系
研究人员发现,某城市的冰淇淋销售量与溺水人数之间存在正相关关系。请回答以下问题:
  1. 这种相关关系是否意味着吃冰淇淋会导致溺水?请解释
  2. 可能存在什么潜在变量影响这两个变量?
  3. 如何确定变量间是否存在因果关系?
Researchers found a positive correlation between ice cream sales and drowning deaths in a city. Please answer the following questions:
  1. Does this correlation mean that eating ice cream causes drowning? Please explain
  2. What potential variables might affect both of these variables?
  3. How can we determine if there is a causal relationship between variables?
答案解析

1. 相关与因果的区别

这种相关关系并不意味着吃冰淇淋会导致溺水。相关关系只表示两个变量之间存在统计上的关联,但不能直接推断因果关系。在这个例子中,正相关可能是由于其他因素导致的,而不是冰淇淋消费直接导致溺水。

2. 潜在变量

最可能的潜在变量是温度或季节。在夏季,气温较高时,人们更倾向于购买冰淇淋,同时也更可能去游泳,从而增加了溺水的风险。因此,温度是影响这两个变量的共同因素,造成了它们之间的正相关关系。

3. 确定因果关系的方法

确定因果关系需要满足以下条件:

  • 时间顺序:原因必须发生在结果之前
  • 相关性:变量之间必须存在统计上显著的关联
  • 排除其他解释:需要控制或排除潜在的混淆变量
  • 实验证据:理想情况下,通过随机对照试验(RCT)来证明因果关系

在冰淇淋和溺水的例子中,我们可以通过分析不同温度下的溺水数据,或者在控制温度变量后观察冰淇淋消费与溺水之间的关系,来判断是否存在真正的因果关系。

1. Difference between Correlation and Causation

This correlation does not mean that eating ice cream causes drowning. Correlation only indicates a statistical association between two variables but cannot directly infer causation. In this case, the positive correlation might be due to other factors rather than ice cream consumption directly causing drowning.

2. Potential Variables

The most likely potential variable is temperature or season. During summer, when temperatures are higher, people are more likely to buy ice cream and also more likely to swim, thereby increasing the risk of drowning. Therefore, temperature is a common factor affecting both variables, creating a positive correlation between them.

3. Methods to Determine Causal Relationships

To establish a causal relationship, the following conditions need to be met:

  • Temporal order: The cause must occur before the effect
  • Correlation: There must be a statistically significant association between variables
  • Elimination of other explanations: Potential confounding variables need to be controlled or excluded
  • Experimental evidence: Ideally, proving causation through randomized controlled trials (RCTs)

In the ice cream and drowning example, we can analyze drowning data at different temperatures or observe the relationship between ice cream consumption and drowning while controlling for temperature variables to determine if there is a true causal relationship.

练习题4:综合应用
某研究调查了8名儿童的年龄(岁)和身高(厘米)数据:
A study investigated the age (in years) and height (in centimeters) of 8 children:
儿童 年龄 (x) 身高 (y)
1 2 85
2 3 90
3 4 98
4 5 105
5 6 112
6 7 118
7 8 125
8 9 132
  1. 计算年龄和身高之间的积矩相关系数
  2. 计算最小二乘回归直线方程
  3. 预测一名10岁儿童的身高
  4. 讨论这个预测的可靠性
  1. Calculate the product moment correlation coefficient between age and height
  2. Calculate the least squares regression line equation
  3. Estimate the height of a 10-year-old child
  4. Discuss the reliability of this prediction
答案解析

1. 计算积矩相关系数

首先,计算必要的统计量:

  • n = 8
  • Σx = 2+3+4+5+6+7+8+9 = 44
  • Σy = 85+90+98+105+112+118+125+132 = 865
  • Σxy = (2×85)+(3×90)+(4×98)+(5×105)+(6×112)+(7×118)+(8×125)+(9×132) = 4981
  • Σx² = 2²+3²+4²+5²+6²+7²+8²+9² = 284
  • Σy² = 85²+90²+98²+105²+112²+118²+125²+132² = 95697

1. Calculate Product Moment Correlation Coefficient

First, calculate the necessary statistics:

  • n = 8
  • Σx = 2+3+4+5+6+7+8+9 = 44
  • Σy = 85+90+98+105+112+118+125+132 = 865
  • Σxy = (2×85)+(3×90)+(4×98)+(5×105)+(6×112)+(7×118)+(8×125)+(9×132) = 4981
  • Σx² = 2²+3²+4²+5²+6²+7²+8²+9² = 284
  • Σy² = 85²+90²+98²+105²+112²+118²+125²+132² = 95697

计算相关系数r:

\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]

\[r = \frac{8\times4981 - 44\times865}{\sqrt{[8\times284 - 44^2][8\times95697 - 865^2]}}\]

\[r = \frac{39848 - 38060}{\sqrt{[2272 - 1936][765576 - 748225]}}\]

\[r = \frac{1788}{\sqrt{336 \times 17351}}\]

\[r = \frac{1788}{\sqrt{5830936}}\]

\[r = \frac{1788}{2414.73}\]

\[r \approx 0.740\]

计算回归系数b:

\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]

\[b = \frac{8\times4981 - 44\times865}{8\times284 - 44^2}\]

\[b = \frac{39848 - 38060}{2272 - 1936}\]

\[b = \frac{1788}{336}\]

\[b = 5.3214\]

计算截距a:

\[a = \bar{y} - b\bar{x}\]

\[a = \frac{865}{8} - 5.3214\times\frac{44}{8}\]

\[a = 108.125 - 5.3214\times5.5\]

\[a = 108.125 - 29.2677\]

\[a = 78.8573\]

因此,回归直线方程为:\[y = 78.8573 + 5.3214x\]

3. 预测身高

当x = 10岁时,代入回归方程:

y = 78.8573 + 5.3214 × 10 = 78.8573 + 53.214 = 132.0713厘米

因此,预测一名10岁儿童的身高约为132.07厘米。

4. 预测可靠性讨论

这个预测的可靠性受到以下因素影响:

  • 相关强度:r ≈ 0.740,表明年龄和身高之间存在较强的正相关关系,这为预测提供了一定的可靠性
  • 样本量:样本量较小(仅有8个数据点),这可能影响预测的准确性
  • 外推问题:预测的x值(10岁)刚好超出了原始数据范围(最大为9岁),这是轻微的外推
  • 个体差异:身高受到多种因素影响,包括遗传、营养等,因此单个预测可能存在偏差

总体而言,这个预测在群体水平上是合理的,但对于个体儿童可能不够精确。

3. Height Prediction

When x = 10 years, substituting into the regression equation:

y = 78.8573 + 5.3214 × 10 = 78.8573 + 53.214 = 132.0713 cm

Therefore, the predicted height of a 10-year-old child is approximately 132.07 cm.

4. Discussion on Prediction Reliability

The reliability of this prediction is affected by the following factors:

  • Correlation strength: r ≈ 0.740, indicating a strong positive correlation between age and height, which provides some reliability for the prediction
  • Sample size: The small sample size (only 8 data points) may affect the accuracy of the prediction
  • Extrapolation issue: The predicted x value (10 years) is just outside the range of the original data (maximum is 9 years), which is a slight extrapolation
  • Individual differences: Height is influenced by various factors, including genetics and nutrition, so individual predictions may have deviations

Overall, this prediction is reasonable at the population level but may not be precise for individual children.