异常值 - 识别和处理数据中的极端值
偏离数据整体模式的极端值,可能是合理观测值,也可能是错误数据(称为"异常点,anomalies")。
基于四分位数(IQR):
若数值满足 \( \text{数值} > Q_3 + k(Q_3 - Q_1) \) 或 \( \text{数值} < Q_1 - k(Q_3 - Q_1) \)(\( k \) 由题目给定,常见为1.5),则为异常值。
基于均值与标准差:
若数值满足 \( \text{数值} > \bar{x} + k\sigma \) 或 \( \text{数值} < \bar{x} - k\sigma \)(\( k \) 由题目给定,常见为2),则为异常值。
移除错误的异常点(anomalies),避免对数据分析产生误导。
The blood glucose levels of 30 females are recorded. The results, in mmol/litre, are shown below:
1.7, 2.2, 2.2, 3.2, 3.2, 3.5, 2.7, 3.1, 3.2, 3.6, 3.7, 3.7, 3.7, 3.8, 3.8, 3.8, 3.8, 3.9, 3.9, 3.9, 4.0, 4.0, 4.0, 4.0, 4.4, 4.5, 4.6, 4.7, 4.8, 5.0, 5.1
An outlier is an observation that falls either \( 1.5 \times \) the interquartile range above the upper quartile, or \( 1.5 \times \) the interquartile range below the lower quartile.
a) Find the quartiles.
b) Find any outliers.
a) 计算四分位数:
b) 判断异常值:
The lengths, in cm, of 12 giant African land snails are given below:
17, 18, 18, 19, 20, 20, 20, 20, 21, 23, 24, 32
a) Calculate the mean and standard deviation, given that \( \Sigma x = 252 \) and \( \Sigma x^2 = 5468 \).
b) An outlier is an observation which lies \( \pm 2 \) standard deviations from the mean. Identify any outliers for these data.
a) 计算均值和标准差:
b) 判断异常值: