Statistical inference is concerned with making decisions or predictions about parameters — the numerical measures that characterize a population. Three parameters you encountered in earlier chapters are the population mean $\mu$, the population standard deviation $\sigma$, and the binomial proportion, $p$.
To estimate the value of a population parameter, you can use the information from the sample in the form of an estimator. Estimators are calculated using information from the sample observations, and so estimators are themselves statistics.
For a statistical point estimator, the sampling distribution of the estimator provides information about the best estimator. Two characteristics are valuable in a point estimator. First, the sampling the distribution of the point estimator should be centered over the true value of the parameter to be estimated. That is, the estimator should not consistently underestimate or overestimate the parameter of interest. Such an estimator is said to be unbiased.
The sampling distributions for an unbiased estimator and a biased estimator are shown in the figure. The sampling distribution for the biased estimator is shifted to the right of the true value of the parameter. This biased estimator is more likely than an unbiased one to overestimate the value of the parameter.
The second desirable characteristic of an estimator is that the spread (as measured by the variance) of the sampling distribution should be as small as possible. This ensures that, with a high probability, an individual estimate will fall close to the true value of the parameter. The sampling distributions for two unbiased estimators, one with a small variance and the other with a larger variance, are shown in the figure below. Naturally, you would prefer the estimator with the smaller variance because the estimates tend to lie closer to the true value of the parameter than in the distribution with the larger variance.
In real-life sampling situations, you may know that the sampling distribution of an estimator centers about the parameter that you are attempting to estimate, but all you have is the estimate computed from the n measurements contained in the sample. How far from the true value of the parameter will your estimate lie? How close is the marksman's bullet to the bull's-eye? The distance between the estimate and the true value of the parameter is called the error of estimation.
For the work we do, you may assume that the sample sizes are always large and, therefore, that the unbiased estimators you will study have sampling distributions that can be approximated by a normal distribution (because of the Central Limit Theorem). Remember that, for any point estimator with a normal distribution, the Empirical Rule states that approximately 95% of all the point estimates will lie within two (or more exactly, 1.96) standard deviations of the mean of that distribution. For unbiased estimators, this implies that the difference between the point estimator and the true value of the parameter will be less than 1.96 standard deviations or 1.96 standard errors (SE), and this quantity, called the margin of error (E), provides a practical upper bound for the error of estimation (see the figure below). It is possible that the error of estimation will exceed this margin of error (E), but that is very unlikely.
You should notice that in calculating the standard errors for these two point estimates, you needed to estimate $\sigma$ with $s$, $p$ with $\hat{p}$ and $q$ with $\hat{q}$. These approximate standard errors will differ only slightly from the true value of SE when the sample size $n$ is large, and they will have little effect on the margin of error. In fact the table below shows that, for most values of $p$ — especially when $p$ is between 0.3 and 0.7 — there is very little change in $\sqrt{pq}$, the numerator of SE, as $p$ changes.
$p$ | $pq$ | $\sqrt{pq}$ |
0.1 | 0.09 | 0.30 |
0.2 | 0.16 | 0.40 |
0.3 | 0.21 | 0.46 |
0.4 | 0.24 | 0.49 |
0.5 | 0.25 | 0.50 |
0.6 | 0.24 | 0.49 |
0.7 | 0.21 | 0.46 |
0.8 | 0.16 | 0.40 |
0.9 | 0.09 | 0.30 |
Example: A marketing analyst wants to estimate the average amount spent by a dating site customer per year. A random sample of $n=50$ dating site customers were polled about the amount they spend each year on dating websites. The results of the poll produced a mean amount of $\$240$ with a standard deviation of $\$20$. Use this information to estimate the population mean amount spent by a dating website customer per year.
Solution: The random variable is the amount spent by a dating site customer per year. The point estimate of $\mu$ is $\bar{x}=\$240$. The margin of error is \[ 1.96\cdot SE = 1.96\cdot\sigma_{\bar{x}}=1.96\cdot\frac{\sigma}{\sqrt{n}}=1.96\cdot\frac{\sigma}{\sqrt{50}} \] Since the sample size is large (greater than 30) the analyst can approximate the value of $\sigma$ with $s$. Therefore, the margin of error is approximately \[ 1.96\cdot\frac{s}{\sqrt{n}}=1.96\cdot\frac{20}{\sqrt{50}}\doteq\$5.54 \] The analyst can feel fairly confident that the sample estimate of $\$240$ is within $\$5.54$ of the population mean.
Example:
In early April 2014, a major security flaw affecting perhaps 500,000 or more websites was announced and fixed. But the patch to the "secure socket" program that is supposed to encrypt and protect user information on secure websites was only made after more than two years of vulnerability on some of the most heavily trafficked sites, including Facebook, Google, YouTube, Yahoo and Wikipedia. Analysts warned that untold numbers of internet users might have had key personal information compromised either in their use of those websites, or their use of email, instant messaging, and even supposedly secure virtual personal networks.
The software bug was named "Heartbleed" and it was accidentally introduced to the OpenSSL encryption program on New Year’s Eve 2011. Some security commentators called Heartbleed "catastrophic" and said it was one of the worst vulnerabilities ever discovered on the web. Shortly after the bug, a researcher was interested in determining the percentage of people who changed their passwords or cancelled accounts. A random sample of 1501 internet users were polled and 39% of those polled indicated they had changed their passwords or cancelled accounts since the announcement of "Heartbleed." Use this information to estimate the true population proportion of adults who changed their passwords or cancelled accounts since the announcement of "Heartbleed," and find the margin of error for the estimate.
Solution: The best estimator of the population proportion $p$ is the sample proportion $\hat{p}$, which for this sample is $\hat{p}=0.39$. In order to find the margin of error you can approximate the value of $p$ with its estimate $\hat{p}=0.39=39\%$. Then: \[ 1.96\cdot SE = 1.96\cdot\sqrt{\frac{pq}{n}}\approx1.96\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}=1.96\cdot\sqrt{\frac{0.39\cdot0.61}{1501}}\doteq0.025=2.5\% \] With this margin of error you can be fairly confident that the estimate of 39% is within $\pm2.5\%$ of the true value of $p$. You can conclude the true value of $p$ could be as low as 36.5% or as high as 41.5%. The margin of error is quite small and reflects the fact that large sample sizes are required to achieve a small margin of error.
As you saw in the table above the margin of error using the estimator $\hat{p}$ is a maximum when $p=0.5$. Some pollsters routinely use the maximum margin of error when estimating $p$, in which case they calculate \[ 1.96\cdot SE = 1.96\cdot\sqrt{\frac{0.5\cdot0.5}{n}} \] or sometimes \[ 2SE = 2\cdot\sqrt{\frac{0.5\cdot0.5}{n}} \] Gallup, Harris and Roper polls generally use sample sizes of approximately 1000, so their margin of error is \[ 1.96\cdot SE = 1.96\cdot\sqrt{\frac{0.5\cdot0.5}{1000}}=1.96(0.016)=0.031 \] or approximately 3%. In this case the estimate is said to be within $\pm3$ percentage points of the true population proportion.
Each time you draw a sample and construct a confidence interval for a parameter, you hope to include the parameter in your interval, but, sometimes you miss. Your "success rate" — the percentage of intervals that include the parameter in repeated sampling —is the confidence level.
You may want to change the level of confidence from $c = 0.95$ to an other confidence level, $c$. When you change $c$ to something other than 95%, a value different from $z = 1.96$ will need to be used to find the Margin of Error. You will need to change the value of $z = 1.96$ — which locates an area 0.95 in the center of the standard normal curve — to a value of z that locates the area $c$ in the center of the curve, as shown in the figure below. Since the total area under the curve is 1, the remaining area in the two tails is $1-c$, and each tail contains area $\dfrac{1}{2}(1-c)$. Then $c$ is the percent of the area under the normal curve between $-z_{c}$ and $z_{c}$.
Confidence Level $c$ | $1-c$ | $z_{c}$ |
0.90 | 0.10 | 1.645 |
0.95 | 0.05 | 1.96 |
0.98 | 0.02 | 2.33 |
0.99 | 0.01 | 2.58 |
$\bar{x}-E$ < $\mu$ < $\bar{x}+E$
where \[ E= z_{c}\cdot\frac{\sigma}{\sqrt{n}} \] where $z_{c}$ is the $z$ value corresponding to an area of $\dfrac{1}{2}(1-c)$ in the right tail of a standard normal z distribution, and \[ \eqalign{ n &= \text{sample size}\cr \sigma &= \text{standard deviation of the sampled population}\cr c&= \text{the confidence level} } \] If $\sigma$ is unknown, it can be approximated by the sample standard deviation $s$ when the sample size is large $(n\geq30)$ and the approximate margin of error is \[ E= z_{c}\cdot\frac{s}{\sqrt{n}} \]Here is where the formula comes from: the Central Limit Theorem tells us the sampling distribution of sample means is bell-shaped. Then the Empirical Rule tells us that 95% of sample means will be within 1.96 standard errors away from $\mu$, or that \[ P\biggr( \mu - 1.96\frac{\sigma}{\sqrt{n}} < \bar{x} < \mu + 1.96\frac{\sigma}{\sqrt{n}} \biggr) = 0.95 \] If you change the value of $z=1.96$ to $z_{c}$, then \[ P\biggr( \mu - z_{c}\frac{\sigma}{\sqrt{n}} < \bar{x} < \mu+ z_{c}\frac{\sigma}{\sqrt{n}} \biggr) = c \] This last probability statement says that a randomly selected $\bar{x}$ will be within $z_{c}$ standard errors away from the population mean with a frequency of $c$%. You can rewrite this inequality as \[ -z_{c}\frac{\sigma}{\sqrt{n}} < \bar{x}-\mu < z_{c}\frac{\sigma}{\sqrt{n}} \] if you subtract $\mu$ from all three parts of the inequality. Afterwards, we subtract $\bar{x}$ from all three parts of the inequality to get \[ -\bar{x}- z_{c}\frac{\sigma}{\sqrt{n}} < -\mu < -\bar{x}+ z_{c}\frac{\sigma}{\sqrt{n}} \] Then we multiply the inequality by $(-1)$ \[ \bar{x} + z_{c}\frac{\sigma}{\sqrt{n}} > \mu > \bar{x}- z_{c}\frac{\sigma}{\sqrt{n}} \] and change the direction of the inequality symbols. Now we can reverse the inequality direction using the symmetry property of inequalities ( i.e., since $a>b>c$ is equivalent to $c$ < $b$ < $a$) and write \[ \bar{x} - z_{c}\frac{\sigma}{\sqrt{n}} < \mu < \bar{x}+ z_{c}\frac{\sigma}{\sqrt{n}} \] so that \[ P\biggr( \bar{x} - z_{c}\frac{\sigma}{\sqrt{n}} < \mu < \bar{x}+ z_{c}\frac{\sigma}{\sqrt{n}} \biggr)=c \] Both $\bar{x} - z_{c}\frac{\sigma}{\sqrt{n}}$ and $\bar{x}+ z_{c}\frac{\sigma}{\sqrt{n}}$, the lower and upper confidence limits, are actually random quantities that depend on the sample mean $\bar{x}$. Therefore, in repeated sampling, the random interval, $\bar{x}\pm z_{c}\frac{\sigma}{\sqrt{n}}$, will contain the population mean $\mu$ with probability $c$.
Example: Suppose that for a random sample of 50 computers at a certain electronics store, the mean repair cost was $\$167$. Assume the population standard deviation was $\$26$. Construct a 95% confidence interval estimate for the population mean repair cost.
Solution: The random variable, $x$, represents the cost, in dollars, of a computer repair. The point estimate of $\mu$ is $\bar{x}=\$267$. The margin of error is \[ E=1.96\cdot SE = 1.96\sigma_{\bar{x}}= 1.96\frac{\sigma}{\sqrt{n}}=1.96\frac{26}{\sqrt{50}}\doteq\$7.21 \] The approximate 95% confidence interval is \[ \eqalign{ \bar{x} & \pm \ E\cr \bar{x} & \pm \ z_{c}\biggr( \frac{\sigma}{\sqrt{n}} \biggr)\cr \$267 & \pm \ 1.96\biggr( \frac{\$26}{\sqrt{50}} \biggr)\cr \$267 & \pm \ \$7.21 } \] The 95% confidence interval for $\mu$ is from $\$259.79$ to $\$274.21$.
Suppose that for a random sample of 120 TVs at a certain electronics store, the mean repair cost was $\$220$. Assume the population standard deviation was $\$16$. Construct a 95% confidence interval estimate for the population mean repair cost.
Solution: The random variable, $x$, represents the cost, in dollars, of a computer repair. The point estimate of $\mu$ is $\bar{x}=\$220$. The margin of error is \[ E=1.96\cdot SE = 1.96\sigma_{\bar{x}}= 1.96\frac{\sigma}{\sqrt{n}}=1.96\frac{26}{\sqrt{50}}\doteq\$2.86 \] The approximate 95% confidence interval is \[ \eqalign{ \bar{x} & \pm \ E\cr \bar{x} & \pm \ z_{c}\biggr( \frac{\sigma}{\sqrt{n}} \biggr)\cr \$220 & \pm \ 1.96\biggr( \frac{\$16}{\sqrt{120}} \biggr)\cr \$220 & \pm \ \$2.86 } \] The 95% confidence interval for $\mu$ is from $\$217.14$ to $\$222.86$.
What does mean it mean to say you are "95% confident" that the true value of the population mean $\mu$ is within a given interval? If you were to construct 20 such intervals, each using different sample information, your intervals might look like those shown in the figure below. Of the 20 intervals, you might expect that 95% of them, 19 out of 20, will perform as planned and contain $\mu$ within their upper and lower bounds. Remember that you cannot be absolutely sure that any one particular interal contains the mean $\mu$. You will never know whether your particular interval is the one out of the 19 that "worked," or whether it is the one interval that "missed." Your confidence in the estimated interval follows from the fact that when repeated intervals are calculated, 95% of these intervals will contain $\mu$.
A good confidence interval has two desirable characteristics:
Example: A researcher wants to estimate the average amount of time, in hours per day, that a U.S. teenager spends consuming media — watching TV, listening to music, surfing the Web, social networking, and playing video games. A random sample of $n=1000$ U.S. teenagers were polled about the amount of time they spend daily consuming media. The results of the poll produced a mean amount of $7.6$ hours with a standard deviation of 2.2 hours. Use this information to estimate the population mean. Use a 90% level of confidence. (Source: The Washington Post)
Solution: The random variable, $x$ represents the time, in hours per day, that a U.S. teenager spends consuming media. The point estimate of $\mu$ is $\bar{x}=7.6$. The critical value of $z_{c}$ can be found on the table of common values. (Alternatively, we could use the calculator, $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=1.645$ with $c=0.90$.) The margin of error is \[ E=1.645\cdot SE = 1.645\cdot\sigma_{\bar{x}}= 1.645\cdot\frac{\sigma}{\sqrt{n}} \] Since the sample size is large (greater than 30) the researcher can approximate the value of $\sigma$ with $s$. Therefore, the margin of error is approximately \[ E=1.645\cdot SE \doteq 1.645\cdot\frac{s}{\sqrt{n}}=1.645\cdot\frac{2.2}{\sqrt{1000}}\approx0.114 \text{ hours} \] The approximate 90% confidence interval is \[ \eqalign{ \bar{x} & \pm \ E\cr 7.6 & \pm \ 0.114 } \] The 90% confidence interval for $\mu$ is from 7.49 hours to 7.71 hours per day.
Example: Find the critical value $z_{c}$ that must be used in the margin of error formula for a 92% confidence interval estimate for $\mu$.
Solution: Notice in the figure below that the area under the standard normal distribution, left of a vertical line at $z_c$ is \[ \eqalign{ \dfrac{1}{2}(1-c)+c & = \dfrac{1}{2} -\dfrac{1}{2}\cdot c+1\cdot c \cr & =\dfrac{1}{2} +\Bigg(-\dfrac{1}{2}c\Bigg)+1c \cr & = \dfrac{1}{2} +\Bigg(-\dfrac{1}{2}+1\Bigg)\cdot c \cr & = \dfrac{1}{2} + \dfrac{1}{2}\cdot c \cr & = \dfrac{1}{2} (1+c) } \] Use the "invnorm" command on the calculator \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.92)\Bigg)\doteq1.75 \] Some of the TI calculators prompt you to enter values for $\mu$ and $\sigma$ when using the invnorm command. If that is the case, make sure you are using the values of $\mu$ and $\sigma$ that correspond to the standard normal distribution of z scores. That is, use $\mu=0$ and $\sigma=1$.
Find the critical value $z_{c}$ that must be used in the margin of error formula for a 88% confidence interval estimate for $\mu$.
Solution: Use the "invnorm" command on the calculator \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.88)\Bigg)\doteq1.55 \] Some of the TI calculators prompt you to enter values for $\mu$ and $\sigma$ when using the invnorm command. If that is the case, make sure you are using the values of $\mu$ and $\sigma$ that correspond to the standard normal distribution of z scores. That is, use $\mu=0$ and $\sigma=1$.
Example: Find a 92% confidence interval estimate for $\mu$ assuming sample size $n$ was 200. Suppose $\sigma$ was known to be 33.3 from past history and that the sample mean, $\bar{x}$ was found to be 250.5.
Solution: The point estimate of $\mu$ is $\bar{x}=250.5$. The critical value of $z_{c}$ for the margin of error formula was found in the above example to be about 1.75. The margin of error is then \[ \eqalign{ E & =z_{c}\cdot SE \cr & = z_{c}\cdot\sigma_{\bar{x}} \cr & = z_{c}\cdot\frac{\sigma}{\sqrt{n}} \cr & = 1.75\cdot\frac{33.3}{\sqrt{200}} \cr & \doteq 4.1 } \] The approximate 92% confidence interval is \[ \eqalign{ \bar{x} & \pm \ E\cr 250.5 & \pm \ 4.1 } \] The 92% confidence interval for $\mu$ is from 246.4 to 254.6.
When we use technology to estimate the confidence interval the result is often expressed as an interval, such as (18.256, 23.744). The sample mean $\bar{x}$ is the value midway between those limits, and the margin of error E is one-half the difference between those limits (because the upper limit is $\bar{x}+E$ and the lower limit is $\bar{x}-E$, the distance separating them is 2E).
\[
\eqalign{
\text{Point estimate of } \mu : & \qquad \bar{x}=\dfrac{(\text{upper confidence limit})+(\text{lower confidence limit})}{2} \cr
& \cr
\text{Margin of Error: } & \qquad E=\dfrac{(\text{upper confidence limit})-(\text{lower confidence limit})}{2}
}
\]
Example: The calculator screen below displays the results from counting the number of different types of donuts in a sample of 100. Use the given confidence interval to find the point estimate $\bar{x}$ and the margin of error E.
Solution:
\[
\eqalign{
\bar{x} &=\dfrac{(\text{upper confidence limit})+(\text{lower confidence limit})}{2} \cr
&=\dfrac{23.744+18.256}{2}\cr
& = 21
}
\]
and
\[
\eqalign{
E &=\text{upper confidence limit} - \text{point estimate} \cr
&=(\bar{x}+E)-\bar{x}\cr
&= 23.744-21\cr
& = 2.744
}
\]
Example: A researcher wants to estimate the average amount of time, in hours per day, that a U.S. teenager spends consuming media — watching TV, listening to music, surfing the Web, social networking, and playing video games. A random sample of $n=1000$ U.S. teenagers were polled about the amount of time they spend daily consuming media. The results of the poll produced a mean amount of $7.6$ hours with a standard deviation of 2.2 hours. Use this information to estimate the population mean. Use a 90% level of confidence. (Source: The Washington Post)
Calculator Solution:
Conclusion: The 90% confidence interval for $\mu$ is from 7.49 hours to 7.71 hours per day.
Find the 95% confidence interval for the mean using this sample.
45 52 35 22 62 34 42 46 53 58 36 40 43 16 23
54 27 32 24 53 62 67 84 36 44 49 57 35 25 30
The population standard deviation $\sigma$ is unknown. Since the sample size is $n \geq30$, you can use the sample standard deviation $s$ as an approximation for $\sigma$. After the data values are entered in L1 (step 1 above), press STAT, move the cursor to CALC, press 1 for 1-Var Stats, then press ENTER. The sample standard deviation of 15.50246551 will be one of the statistics listed. Then continue with step 2. At step 5 on the line for $\sigma$, press VARS for variables, press 5 for Statistics, press 3 for $S_x$.
Conclusion:
The 95% confidence interval for $\mu$ is from $37.3$ to $\$48.4$.
Notice the output from ZIinterval command gives us the sample mean and sample standard deviation.
Example: A pizza shop owner wishes to find the 95% confidence interval estimate for the true mean cost of a large pepperoni pizza. How large should the sample be if she wishes to be accurate within $\$0.15$? A previous study showed that the standard deviation of the price was $\$0.26$.
Solution:
\[
\eqalign{
n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr
& =\biggr(\frac{1.96\cdot\$0.26}{\$0.15}\biggr)^2 \cr
& = 11.54187378 \color{red}{\text{ always round up!}}\cr
& \doteq 12
}
\]
The minimum sample size required is 12 prices.
Question Determine the minimum sample size required when you want to be 99% confident that the sample mean is within two units of the population mean and $\sigma=1.4$. Assume the population is normally distributed.
Solution: We are given $\sigma= 1.4$, $E=2 $ and $c=0.99$. Then \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.99)\Bigg)\doteq2.58 \] and \[ \eqalign{ n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr & =\biggr(\frac{2.58\cdot1.4}{2}\biggr)^2 \cr & = 3.261636 \color{red}{\text{ always round up!}}\cr & \doteq 4 } \] The minimum sample size required is 4 measurements.
Question A beverage company uses a machine to fill one-liter bottles with water. Assume the population of volumes is normally distributed. The company wants to estimate the mean volume of water the machine is putting in the bottles within one milliliter ($mm$). Determine the minimum sample size required to construct a 96% confidence interval estimate for $\mu$. Assume the population standard deviation is 3 millimeters.
Solution: We are given $\sigma= 3 mm$, $E=1 mm$ and $c=0.96$. Then \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.96)\Bigg)\doteq2.05 \] and \[ \eqalign{ n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr & =\biggr(\frac{2.05\cdot3 mm}{1 mm}\biggr)^2 \cr & = 37.8225 \color{red}{\text{ always round up!}}\cr & \doteq 38 } \] The minimum sample size required is 38 one-liter bottles.
Many research experiments or sample surveys have as their objective the estimation of the proportion of people or objects in a large group that possess a certain, characteristic. Here are some examples:
Each is a practical example of the binomial experiment, and the parameter to be estimated is the binomial proportion, $p$. When the sample size is large, \[ \hat{p}=\frac{x}{n}=\frac{\text{total number of successes}}{\text{total number of trials}} \] is the best point estimator for the population proportion $p$. Since its sampling distribution is approximately normal, with mean $p$ and standard error SE $= \sqrt{\frac{pq}{n}}$, $\hat{p}$ can be used to construct a confidence interval according to the general approach given here.
$\hat{p}-E$ < $p$ < $\hat{p}+E$
where $$ E = \pm z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}} $$ and \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg) \] $z_{c}$ is the $z$ value corresponding to an area $\dfrac{1}{2}(1-c)$ in the right tail of a standard normal z distribution. Since $p$ and $q$ are unknown, their values are estimated using the best point estimators: $\hat{p}$ and $\hat{q}$ (where $\hat{q}=1-\hat{p}$). The sample size is considered large when the normal approximation to the binomial distribution is adequate — when both $$ np>5 \quad \text{and} \quad nq>5 $$ (but we don't have values for $p$ and $q$, so we use $\hat{p}$ and $\hat{q}$ and check that both $n\hat{p}>5$ and $n\hat{q}>5$).Example: A reporter wants to estimate the percentage of residents in her city who would rate their city as an excellent place to live. The reporter wants to estimate each percentage by race or ethnicity. It is found that residents in her city are predominantly white, non-hispanic. A random sample of 120 people are selected from among those who categorized their race/ethnicity as "white, non-Hispanic," and 55% indicated they would rate their city as an excellent place to live. Another random sample of 120 "Hispanic" individuals is drawn and suppose 40% indicated they would rate their city as an excellent place to live. Use this information to estimate the population proportion of Hispanic residents who would rate their city as an excellent place to live. Use a 99% level of confidence. (Source: journalism.org)
Solution: The random variable, $x$ represents the proportion of Hispanic residents among the 120 sampled who would rate their city as an excellent place to live. The point estimate of $p$ is $\hat{p}=0.40$. (There were $x=n\cdot\hat{p}=120\cdot0.40=48$ individual successes.) Also, notice that both $n\hat{p}>5$ and $n\hat{q}>5$, so that the binomial population distribution we are sampling from is approximately normal. Then, the standard error (the standard deviation of the sampling distribution) can be approximated as \[ \sqrt{\frac{pq}{n}}\approx \sqrt{\frac{\hat{p}\hat{q}}{n}} = \sqrt{\frac{0.40\cdot0.60}{120}}\doteq0.0447213595 \] The value for $z_c$ in the margin of error formula is found with the calculator using $z_c= invnorm\Bigg( \frac{1}{2}(1+c)\Bigg)= invnorm\Bigg( \frac{1}{2}(1+0.99)\Bigg) \doteq 2.58$. The margin of error is then approximated as \[ E= z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}=2.58\cdot0.0447213595\approx0.116 \] The approximate 99% confidence interval is \[ \eqalign{ \hat{p} & \pm \ E\cr 0.40 & \pm \ 0.116\cr } \] Then $0.40-0.116=0.284=28.4\%$ and $0.40+0.116=0.516=51.6\%.$
The 99% confidence interval for $p$ is from 28.4% to 51.6%.
A reporter wants to estimate the percentage of U.S. households that have internet. A random sample of 4000 U.S. households found 2880 households that had internet. Construct a 98% confidence interval estimate of the percentage of U.S. households that have internet.
Solution: The random variable, $x$ represents the number of U.S. households among the 4000 sampled who had internet. The point estimate of $p$ is $\hat{p}=\frac{x}{n}=\frac{2880}{4000}=0.72$. The standard error of the sampling distribution is approximated as \[ \sqrt{\frac{pq}{n}}\approx \sqrt{\frac{\hat{p}\hat{q}}{n}} = \sqrt{\frac{0.72\cdot0.28}{4000}}\doteq0.0070992957 \] The value for $z_c= invnorm\Bigg( \frac{1}{2}(1+c)\Bigg)= invnorm\Bigg( \frac{1}{2}(1+0.98)\Bigg) \doteq 2.33$. The margin of error is then approximated as \[ E= z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}=2.33\cdot0.0070992957\approx0.017 \] The approximate 98% confidence interval is \[ \eqalign{ \hat{p} & \pm \ E\cr 0.72 & \pm \ 0.017\cr } \] Then $0.72-0.017=0.703=70.3\%$ and $0.72+0.017=0.737=73.7\%.$
The 98% confidence interval for $p$ is from 70.3% to 73.7%.
Example: You wish to estimate, with 90% confidence, the population proportion of U.S. adults who are confident in the stability of the U.S. banking system. Your estimate must be accurate to within 3% of the population proportion.
Solution (a): No preliminary or previous estimate is available for $\hat{p}$ and $\hat{q}=0.50$, so we assume $\hat{p}$ and $\hat{q}=0.50$. We are given $E=3\%=0.03$ and $c=0.90$. Then \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.90)\Bigg)\doteq1.64 \] and \[ \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.5}\cdot{0.5}\cdot\biggr(\frac{1.64}{0.03}\biggr)^2 \cr & = 747.111111111 \color{red}{\text{ always round up!}}\cr & \doteq 748 } \] The minimum sample size required is 748 U.S. adults.
Solution (b): A preliminary estimate for $\hat{p}$ was given as 0.43, so the estimate for $\hat{q}=1-\hat{p}=1-0.43=0.57$. We are given $E=3\%=0.03$ and $c=0.90$. Then \[ \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.43}\cdot{0.57}\cdot\biggr(\frac{1.64}{0.03}\biggr)^2 \cr & = 732.4677333 \color{red}{\text{ always round up!}}\cr & \doteq 733 } \] The minimum sample size required is 733 U.S. adults.
Solution (c): Having an estimate of the population proportion reduces the minimum sample size needed.
Question Determine the minimum sample size required when you want to be 95% confident that the sample proportion is within two percentage points of the population proportion. Assume the population is normally distributed.
Solution: No preliminary or previous estimate is available for $\hat{p}$ and $\hat{q}=0.50$, so we assume $\hat{p}$ and $\hat{q}=0.50$. We are given $E=2\%=0.02 $ and $c=0.95$. Then \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.95)\Bigg)\doteq1.96 \] and \[ \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.50}\cdot{0.50}\cdot\biggr(\frac{1.96}{0.02}\biggr)^2 \cr & = 2401 } \] The minimum sample size required is 2401 measurements.
Question A recent report by Pew Research Center estimated that 68% of Americans have smartphones and 45% have tablet computers. How large a sample is needed to estimate the true proportion of Americans with smartphones to within 4% with 86% confidence?
Solution: A preliminary estimate for $\hat{p}$ was given as 0.68, so the estimate for $\hat{q}=1-\hat{p}=1-0.68=0.32$. We are given $E=4\%=0.04$ and $c=0.86$. Then \[ z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.86)\Bigg)\doteq1.48 \] and \[ \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.68}\cdot{0.32}\cdot\biggr(\frac{1.48}{0.04}\biggr)^2 \cr & = 297.8944 \color{red}{\text{ always round up!}}\cr & \doteq 298 } \] The minimum sample size required is 298 Americans.
Suppose you need to estimate the value of $\mu$ but it is impossible or impractical to collect a large sample. Then the estimation procedure outlined above is of no use. This section introduces the estimation procedure that can be used when the sample size is small. Small sample confidence intervals for binomial proportions will be omitted from our discussion.
In discussing the sampling distribution of $\bar{x}$, we made these points:
Unfortunately, when the sample size n is small $(n \text{ less than 30})$, the statistic $(\bar{x} - \mu)/(\sigma/\sqrt{n})$ does not have a normal distribution. Therefore, all the critical values of $z$ that you used before are no longer correct. For example, you cannot say that $\bar{x}$ will lie within 1.96 standard errors of $\mu$ 95% of the time. This problem is not new; it was studied by statisticians and experimenters in the early 1900s. To find the sampling distribution of this statistic, there are two ways to proceed:
The second approach was used by an Englishman named W. S. Gosset in 1908. He derived a complicated formula for the density function of
\[ t=\frac{\bar{x} - \mu}{s/\sqrt{n}} \]for random samples of size $n$ from a normal population, and he published his results under the pen name ''Student." Ever since, the statistic has been known as Student's $t$. It has the following characteristics:
Gosset was a Guinness Brewery employee who needed a distribution that could be used with small samples taken from a normal distributed population. The Irish brewery where he worked did not allow the publication of research results, so Gosset published under the pseudonym "Student."
The divisor $(n - 1)$ in the formula for the sample variance $s^2$ is called the number of degrees of freedom (df) associated with $s^2$. It determines the shape of the $t$ distribution. The origin of the term degrees of freedom is theoretical and refers to the number of independent squared deviations in $s^2$ that are available for estimating $\sigma^2$. These degrees of freedom may change for different applications, and since they specify the correct $t$ distribution to use, you need to remember to calculate the correct degrees of freedom for each application.
The table of normal probabilities for the standard normal $z$ distribution is no longer useful in calculating critical values for the margin of error in your confidence interval formula. Instead, you will use the $t$-table (below). The table body lists critical values of $t_{c}$. The first column of the table is a particular number of degrees of freedom. The top row has a percentage area to the left of a vertical line at $t_c$.
Example 1: For a $t$ distribution with 10 degrees of freedom, the value of $t$ that has an area 0.90 to its left is found in row 10 in the column marked "90%." You should verify that this is $\color{red}{t=1.372}$
Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:
Example 2: Find the $t$ value that represents the $t$-score in the 50th percentile.
Solution: That value of $t$ will be 0 for any sample size, since every $t$ distribution curve is centered at the origin.
Example 3: Find the $t$ value that represents the first quartile of $t$-scores. Assume $n=8.$
Solution: The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 7$; so, the critical value is in the 7th row of the table. The $t$ value that represents the first quartile of $t$-scores is the value separating the lower 25% of $t$-scores from the upper 75%. The $t$-value we are looking for must be in the lower portion of the distribution, with area 25% to its left, as shown in the figure below. Since the $t$ distribution is symmetric about 0, this value is simply the negative (opposite) of the $t$-value that has an area of 0.75 to its left, or $\color{red}{-t = -0.711}$. Notice the table won't give you negative values, instead we have to find the $t$-value that has an area of 1-0.25 = 0.75 to its left and find it's opposite.
Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:
Example 4: Suppose you have a sample of size 15 from a normal distribution. Find a value of $t$ such that only 2.5% of all values of $t$ will be smaller.
Solution: The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 14$, and the necessary $t$-value must be in the lower portion of the distribution, with area 2.5% to its left, as shown in the figure below. Since the $t$ distribution is symmetric about 0, this value is simply the negative of the value on the right-hand side with area 0.975 to its left, or $-t = \color{red}{-2.145}$.
Example 6: Find the value of $t_c$ needed to set up a 90% percent confidence interval estimate for $\mu$. Assume the sample size is 23.
Solution: The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 23 - 1 =22.$ The area under the standardized sampling distribution of $t$-scores just left of a vertical line at $t_c$ is $\frac{1}{2}(1-c)+c$; or equivalently, $\frac{1}{2}(1+c) = \frac{1}{2}(1+0.90) = 0.95$. Therefore, the value of $t$ we need can be found in row 22 of the table, in the column labeled "95%." This value is $\color{red}{1.717}$
Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:
$\bar{x}-E$ < $\mu$ < $\bar{x}+E$
where \[ E= t_{c}\cdot\frac{s}{\sqrt{n}} \] with \[ \eqalign{ n &= \text{sample size,}\cr s &= \text{sample standard deviation}\cr c&= \text{the confidence level}\cr df&= n-1 } \] and $t_{c}$ is the $t$ value on the $(n-1)^{st}$ row of $t$-table that has an area of $\frac{1}{2}(1-c)+c=\frac{1}{2}(1+c)$, left of a vertical line at $t_c$.Example: Suppose that for a random sample of 5 computers at a certain electronics store, the mean repair cost was $\$178$. The sample standard deviation was $\$32$. Assume the population is normally distributed. Construct a 99% confidence interval estimate for the population mean repair cost.
Solution: The point estimate for $\mu$ is the sample mean $\bar{x}=\$178$. The number of degrees of freedom is $df=n-1=4$, which tells us our critical value, $t_{c}$ for the margin of error formula is located on the 4th row of the table. This critical value has an area left equal to 99.5% since \[ \eqalign{ \frac{1}{2}(1+c) & =\frac{1}{2}(1+0.99)\cr & = 0.995\cr & = 99.5\% } \] The area left of the critical value is then 99.5%, with 4 degrees of freedom. This give us the value $t=4.604$ from the $t$-table. The margin of error is
\[ E=t_{c}\cdot\frac{s}{\sqrt{n}} =4.604\cdot \frac{\$32}{\sqrt{5}} \approx \$65.89 \] and the approximate 99% confidence interval is$\bar{x}-E$ < $\mu$ < $\bar{x}+E$
$\$178-\$65.89$ < $\mu$ < $\$178+\$65.89$
$\$112.11$ < $\mu$ < $\$243.89$
We are 99% confidence that the interval from $\$112.11$ to $\$243.89$ contains the true value of the population mean repair cost, $\mu$.
Example: Suppose that for a random sample of 5 computers at a certain electronics store, the mean repair cost was $\$178$. The sample standard deviation was $\$32$. Assume the population is normally distributed. Construct a 99% confidence interval estimate for the population mean repair cost.
Calculator Solution:
Conclusion: The 99% confidence interval for $\mu$ is from $\$112.11$ to $\$243.89$.
Find the 95% confidence interval for the mean using this sample.
625 675 535 406 512
680 483 522 619 575
Conclusion:
The 95% confidence interval for $\mu$ is from 500.35 to 626.05.
Notice the output from the TInterval command also gives us the point estimate (the mean of the sample), and the sample standard deviation, $s$.
Look at one of the columns in the $t$-table. As the degrees of freedom increase, the critical value of $t$ decreases until, when $df = \infty$; the critical $t$-value is the same
as the critical $z$-value for the same tail area. For example, when the area right of $t$ is 95%, the values of $t_{c}$ start at 6.314 for 1 degree of freedom and decrease to a
minimum of $t_{c} = z_{c} = 1.645$. This helps to explain why we use $n = 30$ as the somewhat arbitrary dividing line between large and small samples. When $n = 30 \quad (df = 29)$, the critical values of $t$ are quite close to their normal counterparts. Notice that $t_{c} = 1.699$ is quite close to $z_{c} = 1.645$. Rather than produce a
$t$ table with rows for many more degrees of freedom, the critical values of $z$ are sufficient when the sample size reaches $n = 30$.
The critical values in the $t$-table will allow you to make reliable inferences only if you follow all the rules; that is, your sample must meet these requirements specified by the $t$ distribution:
These requirements may seem quite restrictive. How can you possibly know the shape of the probability distribution for the entire population if you have only a sample? If this were a serious problem, however, the $t$ statistic could be used in only very limited situations. Fortunately, the shape of the $t$ distribution is not affected very much as long as the sampled population has an approximately moundshaped distribution. Statisticians say that the $t$ statistic is robust, meaning that the distribution of the statistic does not change significantly when the normality assumptions are violated.
How can you tell whether your sample is from a normal population? Although there are statistical procedures designed for this purpose. the easiest and quickest way to check for normality is to use the graphical techniques of earlier chapters: Draw a dotplot or construct a stem and leaf plot. As long as your plot tends to "mound
up" in the center, you can be fairly safe in using the $t$ statistic for making inferences.
The random sampling requirement, on the other hand, is quite critical if you want to produce reliable inferences. If the sample is not random, or if it does not at least behave as a random sample, then your sample results may be affected by some unknown factor and your conclusions may be incorrect. When you design an experiment or read about experiments conducted by others, look critically at the way the data have been collected!
The point estimate for \(\sigma^2\) is \(s^2\), and the point estimate for \(\sigma\) is \(s\).
Example: Find the critical values $\chi^2_L$ and $\chi^2_R$ needed to set up a 95% confidence interval when the sample size is 18.
Solution:
Because the sample size is 18,
$$ df =n-1=17$$
The areas to the left of $\chi^2_L$ and $\chi^2_R$ are
Area to the left of $\chi^2_L = \dfrac{1-c}{2}=\dfrac{1-.95}{2}=0.025=2.5\%$
Area to the left of $\chi^2_R = \dfrac{1+c}{2}=\dfrac{1+.95}{2}=0.975=97.5\%$
Using $df=17$ and the areas 2.5% and 97.5% you can find the critical values off of the chi-square table. We find that
$$\chi^2_L =7.564$$
and
$$\chi^2_R =30.191$$
So, for a chi-square distribution curve with 17 degrees of freedom, 95% of the area under the curve lies between 7.564 and 30.191.
Question Find the critical values $\chi^2_L$ and $\chi^2_R$ needed to set up a 90% confidence interval when the sample size is 11.
Solution:
Because the sample size is 11,
$$ df =n-1=10$$
The areas to the left of $\chi^2_L$ and $\chi^2_R$ are
Area to the left of $\chi^2_L = \dfrac{1-c}{2}=\dfrac{1-.90}{2}=0.05=5\%$
Area to the left of $\chi^2_R = \dfrac{1+c}{2}=\dfrac{1+.90}{2}=0.95=95\%$
Using $df=10$ and the areas 5% and 95% you can find the critical values off of the chi-square table. We find that
$$\chi^2_L =3.940$$
and
$$\chi^2_R =18.307$$
So, for a chi-square distribution curve with 10 degrees of freedom, 90% of the area under the curve lies between 3.940 and 18.307.
Example: You randomly select and weigh a sample of 30 allergy medicine pills. The sample standard deviation is 1.2 milligrams. Assuming the weights are normally distributed, construct 99% confidence intervals for the population variance and standard deviation.
Solution:
Because the sample size is 30,
$$ df =n-1=29$$
The areas to the left of $\chi^2_L$ and $\chi^2_R$ are
Area to the left of $\chi^2_L = \dfrac{1-c}{2}=\dfrac{1-.99}{2}=0.005=0.5\%$
Area to the left of $\chi^2_R = \dfrac{1+c}{2}=\dfrac{1+.99}{2}=0.995=99.5\%$
Using $df=29$ and the areas 0.5% and 99.5% you can find the critical values off of the chi-square table. We find that
$$\chi^2_L =13.121$$
and
$$\chi^2_R =52.336$$
The left endpoint of the confidence interval is
$$
\dfrac{(n-1)s^2}{\chi^2_R}=\dfrac{(30-1)1.2^2}{52.336}=0.80
$$
The right endpoint of the confidence interval is
$$
\dfrac{(n-1)s^2}{\chi^2_L}=\dfrac{(30-1)1.2^2}{13.121}=3.18
$$
Question Find the 95% confidence intervals for the population variance and standard deviation of the medicine weights.
Solution:
Because the sample size is 30,
$$ df =n-1=29$$
The areas to the left of $\chi^2_L$ and $\chi^2_R$ are
Area to the left of $\chi^2_L = \dfrac{1-c}{2}=\dfrac{1-.95}{2}=0.025=2.5\%$
Area to the left of $\chi^2_R = \dfrac{1+c}{2}=\dfrac{1+.95}{2}=0.975=97.5\%$
Using $df=29$ and the areas 2.5% and 97.5% you can find the critical values off of the chi-square table. We find that
$$\chi^2_L =16.047$$
and
$$\chi^2_R =45.722$$