Central Tendency and Standard Deviation
More on Distributions
-
Central Tendency
-
Variability
Sample Size
-
How many people should you ask?
-
Suppose you wanted to know who was going to win a presidential
election
-
How confident would you be if you asked one person how they
would vote?
-
Two people? Ten people?
-
One hundred? One thousand? Ten thousand? All
voters?
-
Clearly, up to point, more is better
-
At some point, there are diminishing marginal returns
-
Many (very accurate) national polls are based on only 1000-2000
respondents.
Sample size and Distributions
-
How large should a sample be?
-
To answer this question more specifically, we must think
about distributions.
-
Samples must be moderately large, because sampling gives
rise to distributions.
-
We'll talk more about precisely how large over the course
of the semester.
-
Recently, we thought about how to visualize a distribution
with a graph.
-
How can we summarize a distribution?
Central Tendency and Variability
-
How could we describe a well-behaved distributions?
Central Tendency
-
When the distribution has one peak, a measure of central
tendency makes sense.
-
Not a good measure for multi-peaked distributions
-
Measures of central tendency
-
Mean (arithmetic average)
-
Median
-
Mode
The mean
-
The mean is familiar as the arithmetic average of a set of
numbers
-
Mean = 1/N ( SUM xi )
-
Mean = 1/N (x1 + x2 + x3
+ ... + xN)
-
In this example
-
Mean = 1/48 (8373) = 174.4375
A problem with the mean
-
The mean is not resistant to outliers
-
Without this outlier
-
Mean = 1/47 (8172) = 173.8723
-
Removing one observation decreases the mean
-
If this outlier had been 1000, the mean would be
<===<
The mean and skew
-
The extreme values in the tail of a skewed distribution pull
the mean into the tail.
The Median
-
The Median is the observation at the 50th percentile
-
Half of the observations are above the median
-
Half are below
-
Finding the median (M)
-
When N is odd: M is the middle observation
-
When N is even: M is the mean of the two observations
around the middle.
The median is resistant to outliers
-
The median is unaffected by a single outlier
-
No matter what the value of the largest observation here,
the median is still 173.
-
The median is also resistant to skew
-
Median values are often used for skewed distributions
<===<
The Mode
-
The Mode is the most frequent value.
-
The Mode is not often used in statistics
Demo on
Mean, Median, Mode, and Variability
Variability
-
Central tendency alone leaves out a lot
-
How representative of the distribution is the central tendency
-
Ways of describing the variability
-
Quartiles
-
Variance and Standard Deviation
Demo on
Mean, Median, Mode, and Variability
Percentiles
-
A percentile is the observation such that P% of the observations
are below that observation
-
The 25th percentile is often called the first quartile (Q1)
-
The first quartile is the median of the observations from
the smallest to the median
-
The median is the 50th percentile
-
The 75th percentile is the third quartile (Q3)
-
The third quartile is the median of the observations from
the median to the largest
-
The interquartile range is the size of the interval between
Q1 and Q3.
-
A measure of the variability of the distribution
The Five Number Summary
-
A distribution can be summarized by five numbers
-
Minimum, Q1, M, Q3, Maximum
-
161, 171, 173, 178, 201
-
What does this five number summary say about this distribution?
The Boxplot
-
The five-number summary can be graphed as a boxplot
This is a modified boxplot. The outlier is shown
separately.
The boxplot gives no information about the shape of the
distribution.
The standard deviation
-
The most common measure of the variability
-
Not the most resistant measure
-
Has some nice mathematical properties.
-
How do you find the standard deviation?
-
The mean is the point where the sum of the deviations is
zero.
-
Variance = s 2 = SUM(xi - Mean)2
/ (N-1)
-
Use N-1, because this calculation requires knowing the mean
-
The Variance has N-1 degrees of freedom
-
The variance is in squared units
-
Standard Deviation = s = (Variance)1/2
The standard deviation
-
Only use the standard deviation when the mean is used as
a measure of central tendency.
-
Variance = 51.44
-
Standard Deviation = 7.17
The effects of linear transformations
-
A unit can be transformed by multiplying it by some number
and adding a value.
-
This is a linear transform
-
Measures of central tendency
-
Measures of variability
Summary
-
Measures of central tendency
-
Mean (not resistant)
-
Median (resistant)
-
Mode (resistant)
-
Measures of variability
-
Interquartile range (resistant)
-
Can be graphed as a boxplot
-
Standard deviation (not resistant)
-
Must be used with the mean.