\setcounter{ExampleCounter}{1}
Knowing where a data set is centered is good, but we also would like to be able to measure how spread out a data set is.
The simplest measure of spread is the range of a data set:
\[\textrm{Range } = \textrm{ Largest data value } - \textrm{ Smallest data value}\]
There's a better measure of spread, though: the \textbf{standard deviation}.\\
\begin{proc}{Standard Deviation}
The standard deviation is essentially the average distance of the data points from the mean, the center.\\
For a sample of size $n$, \[s = \sqrt{\dfrac{\sum (x-\overline{x})^2}{n-1}}\]
For a population of size $N$, \[\sigma = \sqrt{\dfrac{\sum (x-\mu)^2}{N}}\]
\end{proc}
Why is it so complicated? Why can't we just find the distances from the mean ($x-\overline{x}$, called the \textbf{deviations}) and average them? The problem is, if we add up all the deviations, the sum will always equal 0, so taking the average won't work.
That's why we square the deviations, then average\footnote{Notice the $n-1$ in the denominator. This isn't quite an average. The reason for the $n-1$ is somewhat complicated, but basically, it's there so that the sample standard deviation is an \emph{unbiased estimator} of the population standard deviation.} them, and then take the square root again.
\vfill
\pagebreak
\begin{example}{Standard Deviation}
The ages of ten fifth-grade students are given below.
\begin{center}
\begin{tabular}{c c c c c}
11 & 10 & 9.5 & 11 & 11.5\\
10.5 & 10 & 11 & 10 & 9.5
\end{tabular}
\end{center}
Find the standard deviation of this data set.
\begin{center}
\begin{tabular}{| l | l | l |}
\hline
\textbf{Data} & \textbf{Deviations} & \textbf{Sq. Deviations}\\
\hline
$x$ & $x-\overline{x}$ & $(x-\overline{x})^2$\\
\hline
11 & 0.475 & 0.225625\\
10 & -0.525 & 0.275625\\
9.5 & -1.025 & 1.050625\\
11 & 0.475 & 0.225625\\
11.5 & 0.975 & 0.950625\\
10.5 & -0.025 & 0.000625\\
10 & -0.525 & 0.275625\\
11 & 0.475 & 0.225625\\
10 & -0.525 & 0.275625\\
9.5 & -1.025 & 1.050625\\
\hline
\end{tabular}
\end{center}
The sum of the squared deviations is 4.55625; divide this by 9 and take the square root: \[\dfrac{4.55625}{9} = 0.50625 \longrightarrow \sqrt{0.50625} = 0.7115\]
\end{example}
Of course, we don't do this process in practice; we just use \verb|1-Var Stats| on the calculator.\\
\begin{proc}{$S_x$ or $\sigma_x$?}
There are two standard deviations listed in \verb|1-Var Stats|: $S_x$ and $\sigma_x$.
\begin{center}
\includegraphics[width=2.5in]{Calc1VarStats}
\end{center}
The difference between them is that $S_x$ is the sample standard deviation, and $\sigma_x$ is the population standard deviation.
Which one to use depends on whether the data set in question is the entire population of interest, or a sample from that population.
\end{proc}
\vfill
\pagebreak
\subsection{z-scores}
\begin{example}{Using the Standard Deviation}
On a baseball team, the ages of each of the players are as follows:
\begin{center}
\begin{tabular}{c c c c c}
21 & 21 & 22 & 23 & 24\\
24 & 25 & 25 & 28 & 29\\
29 & 31 & 32 & 33 & 33\\
34 & 35 & 36 & 36 & 36\\
36 & 38 & 38 & 38 & 40
\end{tabular}
\end{center}
\begin{enumerate}
\item Find the mean and standard deviation.
\paragraph{Mean:} $\overline{x} = 30.68$
\paragraph{Standard Deviation:} $\sigma_x = 5.97$\\
(notice that we use the population standard deviation, since this is the whole team)
\item Find the value that is one standard deviation below the mean.
\[30.68 - 5.97 = 24.71\]
The 24- and 25-year-olds are about one standard deviation below the mean.
\end{enumerate}
\end{example}
In this example, the standard deviation gives us a measure of position that is more powerful than it seems at the moment: the $z$-score.
\vfill
\pagebreak
The $z$-score is the number of standard deviations that a particular data point falls above or below the mean.
\begin{example}{z-scores}
In the baseball team age data set, find the $z$-scores that correspond to the following ages:
\begin{enumerate}[(a)]
\item 26
This data point is \[26-30.68 = -4.68\] units away from the mean, which corresponds to \[-\dfrac{4.68}{5.97} = -0.78\] standard deviations: \[z = -0.78\]
\item 32
Do both steps in one: \[z=\dfrac{32-30.68}{5.97} = 0.22\]
\end{enumerate}
\end{example}
\begin{proc}{z-score}
The $z$-score is the number of standard deviations that a particular data point falls above or below the mean.\\
To find the $z$-score for a particular data point, subtract the mean and divide the answer by the standard deviation:
\[z = \dfrac{x-\overline{x}}{s}\]
\end{proc}
\vfill
\pagebreak
What good are $z$-scores? The first application is in comparing data points in different data sets.
\begin{example}{Comparing Test Scores}
Scores on the SAT and ACT are normally distributed:
\begin{center}
\begin{tabular}{l l l}
Test & Mean & Std. Deviation\\
\hline
SAT & 500 & 100\\
ACT & 18 & 6
\end{tabular}
\end{center}
You score 550 on the SAT and 24 on the ACT. On which test did you have a better score, relative to everyone else who took the test?\\
The $z$-scores for each test score are
\[z_{SAT} = 0.5 \hspace{0.5in} z_{ACT} = 1\]
Since the ACT score is a whole standard deviation above the mean, and the SAT score is only half a standard deviation above the mean, the ACT score is relatively better.
\end{example}
\subsection{Empirical Rule}
Another application is the Empirical Rule. This rule applies to data sets for which the histogram is \textbf{symmetric} and \textbf{bell-shaped}:
\begin{center}
\includegraphics[width=3in]{NormHistogram_Kierano}
\end{center}
\begin{center}
\begin{tikzpicture}
\begin{axis}[
no markers, domain=-4:4, samples=100,
axis lines*=none,
hide y axis,
every axis y label/.style={at=(current axis.above origin),anchor=south},
every axis x label/.style={at=(current axis.right of origin),anchor=west},
height=5cm, width=12cm,
xtick=\empty, ytick=\empty,
enlargelimits=false, clip=false, %axis on top,
%grid = major
]
\addplot [very thick,cyan!50!black] {gauss(0,1)};
\end{axis}
\end{tikzpicture}
\end{center}
\begin{proc}{The Empirical Rule}
\begin{itemize}
\item Approximately 68\% of the data is within \textbf{one} standard deviation of the mean.
\item Approximately 95\% of the data is within \textbf{two} standard deviations of the mean.
\item Approximately 99.7\% of the data is within \textbf{three} standard deviations of the mean.
\end{itemize}
\begin{center}
\begin{tikzpicture}
\begin{axis}[
no markers, domain=-4:4, samples=100,
axis lines*=none, xlabel=$x$,
hide y axis,
every axis y label/.style={at=(current axis.above origin),anchor=south},
every axis x label/.style={at=(current axis.right of origin),anchor=west},
height=5cm, width=12cm,
xtick={-3,-2,-1,0,1,2,3}, ytick=\empty,
xticklabels={$\mu-3\sigma$,$\mu-2\sigma$,$\mu-\sigma$,$\mu$,$\mu+\sigma$,$\mu+2\sigma$,$\mu+3\sigma$},
enlargelimits=false, clip=false, %axis on top,
grid = major
]
\addplot [fill=cyan!20, draw=none, domain=-1:1] {gauss(0,1)} \closedcycle;
\addplot [fill=yellow!20, draw=none, domain=-2:-1] {gauss(0,1)} \closedcycle;
\addplot [fill=yellow!20, draw=none, domain=1:2] {gauss(0,1)} \closedcycle;
\addplot [fill=green!20, draw=none, domain=-3:-2] {gauss(0,1)} \closedcycle;
\addplot [fill=green!20, draw=none, domain=2:3] {gauss(0,1)} \closedcycle;
\addplot [very thick,cyan!50!black] {gauss(0,1)};
\draw [yshift=2.5cm, latex-latex](axis cs:-1,0) -- node [fill=white] {68\%} (axis cs:1,0);
\draw [yshift=1.5cm, latex-latex](axis cs:-2,0) -- node [fill=white] {95\%} (axis cs:2,0);
\draw [yshift=0.5cm, latex-latex](axis cs:-3,0) -- node [fill=white] {99.7\%} (axis cs:3,0);
\end{axis}
\end{tikzpicture}
\end{center}
Note that this diagram uses $\mu$ for the population mean (as opposed to $\overline{x}$ for the sample mean) and $\sigma$ for the population standard deviation (as opposed to $s$ for the sample standard deviation).
\end{proc}
This will come back later when we study the \textbf{Normal Distribution}.
\subsection{Chebyshev's Rule}
This rule applies to any data set, regardless of whether or not it is symmetric and bell-shaped.\\
\begin{proc}{Chebyshev's Rule}
\begin{itemize}
\item At least 75\% of the data is within \textbf{two} standard deviations of the mean.
\item At least 89\% of the data is within \textbf{three} standard deviations of the mean.
\end{itemize}
\end{proc}