The document discusses various quantitative techniques for summarizing data, including measures of central tendency (mean, median, mode) and dispersion (range, quartiles, standard deviation). It provides formulas for calculating the mean, median, mode, variance and standard deviation. An example is given of calculating the mean from a table of dice roll results.
2. Mean, Median, Mode and Range
The mean, median and mode are types
of average.
The range gives a measure of the spread of a
set of data
6. Finding the mean from a table of data
Example
A dice was rolled 20 times. On each roll the dice shows
a value from 1 to 6.
The results have been recorded in the table below:
FIND MEAN
7. divide the total of all the data values by the number of
data values. In this case you need to divide 68 by 20, giving 3.4.
11. 11
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median:
Middle value if odd number of values, or average of the
middle two values otherwise
Estimated by interpolation (for grouped data):
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
n
i
i
x
n
x
1
1
n
i
i
n
i
i
i
w
x
w
x
1
1
width
freq
freq
n
L
median
median
l
)
)
(
2
/
(
1
)
(
3 median
mean
mode
mean
N
x
Median
interval
12. October 25, 2022 Data Mining: Concepts and Techniques
12
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric,
positively and negatively skewed data
positively skewed negatively skewed
symmetric
13. 13
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked; add whiskers,
and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)
Standard deviation s (or ) is the square root of variance s2 (or 2)
n
i
n
i
i
i
n
i
i x
n
x
n
x
x
n
s
1 1
2
2
1
2
2
]
)
(
1
[
1
1
)
(
1
1
n
i
i
n
i
i x
N
x
N 1
2
2
1
2
2 1
)
(
1