异常值练习题 - 掌握异常值识别和数据清洗方法
以下是4道综合练习题,涵盖基于四分位数和标准差的异常值识别方法。
Some data are collected. \( Q_1 = 46 \) and \( Q_3 = 68 \).
A value greater than \( Q_3 + 1.5 \times (Q_3 - Q_1) \) or less than \( Q_1 - 1.5 \times (Q_3 - Q_1) \) is defined as an outlier.
Using this rule, work out whether or not the following are outliers:
a) 7
b) 88
c) 105
解答过程:
先计算 \( \text{IQR} = 68 - 46 = 22 \)。
异常值下限:\( 46 - 1.5 \times 22 = 13 \)。
异常值上限:\( 68 + 1.5 \times 22 = 101 \)。
The masses of male and female turtles are given in grams. For males, the lower quartile was 400 g and the upper quartile was 580 g. For females, the lower quartile was 260 g and the upper quartile was 340 g.
An outlier is an observation that falls either \( 1 \times \) the interquartile range above the upper quartile or \( 1 \times \) the interquartile range below the lower quartile.
a) Which of these male turtle masses would be outliers?
400 g, 260 g, 550 g, 640 g
b) Which of these female turtle masses would be outliers?
170 g, 300 g, 340 g, 440 g
c) What is the largest mass a male turtle can be without being an outlier?
解答过程:
雄性海龟:\( \text{IQR} = 580 - 400 = 180 \)。
异常值下限:\( 400 - 180 = 220 \)。
异常值上限:\( 580 + 180 = 760 \)。
a) \( 640 < 760 \),640 g是异常值。
雌性海龟:\( \text{IQR} = 340 - 260 = 80 \)。
异常值下限:\( 260 - 80 = 180 \)。
异常值上限:\( 340 + 80 = 420 \)。
b) \( 170 < 180 \),\( 440 > 420 \),170 g和440 g是异常值。
c) 雄性海龟不成为异常值的最大质量是 760 g。
The masses of arctic foxes are found and the mean mass was 6.1 kg. The variance was 4.2.
An outlier is an observation which lies \( \pm 2 \) standard deviations from the mean.
a) Which of these arctic fox masses are outliers?
2.4 kg, 10.1 kg, 3.7 kg, 11.5 kg
b) What are the smallest and largest masses that an arctic fox can be without being an outlier?
解答过程:
标准差 \( \sigma = \sqrt{4.2} \approx 2.05 \)。
下限:\( 6.1 - 2 \times 2.05 = 2.0 \)。
上限:\( 6.1 + 2 \times 2.05 = 10.2 \)。
a) \( 11.5 > 10.2 \),11.5 kg是异常值。
b) 最小质量 2.0 kg,最大质量 10.2 kg。
The ages of nine people at a children's birthday party are recorded. \( \Sigma x = 92 \) and \( \Sigma x^2 = 1428 \).
a) Calculate the mean and standard deviation of the ages. (3 marks)
An outlier is an observation which lies \( \pm 2 \) standard deviations from the mean. One of the ages is recorded as 30.
b) State, with a reason, whether or not this is an outlier. (2 marks)
c) Suggest a reason why this age could be a legitimate data value. (1 mark)
d) Given that all nine people were children, clean the data and recalculate the mean and standard deviation. (3 marks)
解答过程:
a) 均值 \( \bar{x} = \frac{92}{9} \approx 10.22 \)。
方差 \( \sigma^2 = \frac{1428}{9} - (10.22)^2 \approx 32.97 \)。
标准差 \( \sigma = \sqrt{32.97} \approx 5.74 \)。
b) 上限:\( 10.22 + 2 \times 5.74 = 21.7 \)。
\( 30 > 21.7 \),是异常值。
c) 可能是记录错误(如把"3岁"误写为"30岁"),或成人(如家长)参与记录但不属于儿童群体。
d) 移除30后,\( n = 8 \),\( \Sigma x = 92 - 30 = 62 \),\( \Sigma x^2 = 1428 - 30^2 = 528 \)。
新均值 \( \bar{x} = \frac{62}{8} = 7.75 \)。
新方差 \( \sigma^2 = \frac{528}{8} - 7.75^2 = 66 - 60.06 = 5.94 \)。
新标准差 \( \sigma = \sqrt{5.94} \approx 2.44 \)。