# Statistical Inference

Statistical inference is concerned with making decisions or predictions about parameters — the numerical measures that characterize a population. Three parameters you encountered in earlier chapters are the population mean $\mu$, the population standard deviation $\sigma$, and the binomial proportion, $p$.

### Methods of Statistical Inference

• Estimation:   Estimating or predicting the value of the parameter
• Hypothesis testing:   Making a decistion about the value of a parameter based on some preconceived idea about what its value might be

### Types of Estimators

To estimate the value of a population parameter, you can use the information from the sample in the form of an estimator. Estimators are calculated using information from the sample observations, and so estimators are themselves statistics.

### Definition

An estimator is a rule, expressed as a formula, that tells us how to calculate an estimate based on information in the sample.
Estimators are used in two different ways:

• Point estimation:   Based on sample data, a single number is calculated to estimate the population parameter. The rule or formula that describes this calculation is called the point estimator, and the resulting number is called a point estimate.
• Interval estimation: Based on sample data, two numbers are calculated to form an interval within which the parameter is expected to lie. The rule or formula that describes this calculation is called the interval estimator, and the resulting pair of numbers is called an interval estimate or confidence interval.

For a statistical point estimator, the sampling distribution of the estimator provides information about the best estimator. Two characteristics are valuable in a point estimator. First, the sampling the distribution of the point estimator should be centered over the true value of the parameter to be estimated. That is, the estimator should not consistently underestimate or overestimate the parameter of interest. Such an estimator is said to be unbiased.

### Definition

An estimator of a parameter is said to be unbiased if the mean of its distribution is equal to the true value the parameter. Otherwise, the estimator is said to be biased.

The sampling distributions for an unbiased estimator and a biased estimator are shown in the figure. The sampling distribution for the biased estimator is shifted to the right of the true value of the parameter. This biased estimator is more likely than an unbiased one to overestimate the value of the parameter. The second desirable characteristic of an estimator is that the spread (as measured by the variance) of the sampling distribution should be as small as possible. This ensures that, with a high probability, an individual estimate will fall close to the true value of the parameter. The sampling distributions for two unbiased estimators, one with a small variance and the other with a larger variance, are shown in the figure below. Naturally, you would prefer the estimator with the smaller variance because the estimates tend to lie closer to the true value of the parameter than in the distribution with the larger variance. ### The Margin of Error

In real-life sampling situations, you may know that the sampling distribution of an estimator centers about the parameter that you are attempting to estimate, but all you have is the estimate computed from the n measurements contained in the sample. How far from the true value of the parameter will your estimate lie? How close is the marksman's bullet to the bull's-eye? The distance between the estimate and the true value of the parameter is called the error of estimation.

### Definition

The distance between an estimate and the estimated parameter is called the error of estimation, (or sampling error). The sampling error of the mean is the formula $\bar{x}-\mu$

For the work we do, you may assume that the sample sizes are always large and, therefore, that the unbiased estimators you will study have sampling distributions that can be approximated by a normal distribution (because of the Central Limit Theorem). Remember that, for any point estimator with a normal distribution, the Empirical Rule states that approximately 95% of all the point estimates will lie within two (or more exactly, 1.96) standard deviations of the mean of that distribution. For unbiased estimators, this implies that the difference between the point estimator and the true value of the parameter will be less than 1.96 standard deviations or 1.96 standard errors (SE), and this quantity, called the margin of error (E), provides a practical upper bound for the error of estimation (see the figure below). It is possible that the error of estimation will exceed this margin of error (E), but that is very unlikely. ### Definition

The margin of error, $E$ (also called the maximum error in the estimate of a parameter), is the maximum likely difference between the point estimate ( $$\bar{x}$$ or $$\hat{p}$$ ) and the population parameter it is estimating ( $$p$$ or $\mu$ ).    I.e., $\Big|\bar{x}-\mu\Big|\leq \text{E} \qquad \text{and} \qquad \Big|\hat{p}-p\Big|\leq \text{E}$

### Point Estimation of a Population Parameter

• Point estimator:  a statistic calculated using sample measurements
• Margin of Error:  1.96 $\times$ Standard error of the estimator

The sampling distributions for the two unbiased point estimators, $$\bar{x}$$ and $$\hat{p}$$ were discussed earlier. It can be shown that both of these point estimators have the minimum variability of all unbiased estimators and are thus the best estimators you can find in each situation.

### How Can I Estimate a Population Mean?

To estimate the population mean $\mu$ for a quantitative population, the point estimator $\bar{x}$ is unbiased with standard error given as $\text{SE}= \dfrac{\sigma}{\sqrt{n}}$ The margin of error is calculated as $\text{E}= \pm1.96\cdot\dfrac{\sigma}{\sqrt{n}}$ If $\sigma$ is unknown and $n$ is 30 or larger, the sample standard deviation $s$ can be used to approximate $\sigma$ When you sample a normal distribution, the statistic $\dfrac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$ has a t distribution, which will be discussed later. When the sample is large, this statistic is approximately normally distributed whether the sampled population is normal or nonnormal.

### How Can I Estimate a Population Proportion?

To estimate the population proportion $p$ for a binomial population , the point estimator $\hat{p} = \dfrac{x}{n}$ is unbiased with standard error given as $\text{SE} = \sqrt{\frac{pq}{n}}$ where $q=1-p$. The margin of error is calculated as $E = \pm1.96\cdot\sqrt{\frac{pq}{n}}$ but has to be estimated as $E = \pm1.96\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}$ Assumptions: $np \geq 5$ and $nq \geq 5$ (so that the population distribution is roughly normal in shape, which guarantees the sampling distribution of $\hat{p}$ will be normally distributed). Since $p$ and $q$ are unknown, use $n\hat{p} \geq 5$ and $n\hat{q} \geq 5$. You should notice that in calculating the standard errors for these two point estimates, you needed to estimate $\sigma$ with $s$, $p$ with $\hat{p}$ and $q$ with $\hat{q}$. These approximate standard errors will differ only slightly from the true value of SE when the sample size $n$ is large, and they will have little effect on the margin of error. In fact the table below shows that, for most values of $p$ — especially when $p$ is between 0.3 and 0.7 — there is very little change in $\sqrt{pq}$, the numerator of SE, as $p$ changes.

 $p$ $pq$ $\sqrt{pq}$ 0.1 0.09 0.30 0.2 0.16 0.40 0.3 0.21 0.46 0.4 0.24 0.49 0.5 0.25 0.50 0.6 0.24 0.49 0.7 0.21 0.46 0.8 0.16 0.40 0.9 0.09 0.30

Example:   A marketing analyst wants to estimate the average amount spent by a dating site customer per year. A random sample of $n=50$ dating site customers were polled about the amount they spend each year on dating websites. The results of the poll produced a mean amount of $\$240$with a standard deviation of$\$20$. Use this information to estimate the population mean amount spent by a dating website customer per year.

## Try This!!

Suppose that for a random sample of 120 TVs at a certain electronics store, the mean repair cost was $\$220$. Assume the population standard deviation was$\$16$. Construct a 95% confidence interval estimate for the population mean repair cost.

Solution:   The random variable, $x$, represents the cost, in dollars, of a computer repair. The point estimate of $\mu$ is $\bar{x}=\$220. The margin of error is $E=1.96\cdot SE = 1.96\sigma_{\bar{x}}= 1.96\frac{\sigma}{\sqrt{n}}=1.96\frac{26}{\sqrt{50}}\doteq\2.86$ The approximate 95% confidence interval is \eqalign{ \bar{x} & \pm \ E\cr \bar{x} & \pm \ z_{c}\biggr( \frac{\sigma}{\sqrt{n}} \biggr)\cr \220 & \pm \ 1.96\biggr( \frac{\16}{\sqrt{120}} \biggr)\cr \220 & \pm \ \2.86 } The 95% confidence interval for\mu$is from$\$217.14$ to $\$222.86$. # Interpreting Confidence Intervals What does mean it mean to say you are "95% confident" that the true value of the population mean$\mu$is within a given interval? If you were to construct 20 such intervals, each using different sample information, your intervals might look like those shown in the figure below. Of the 20 intervals, you might expect that 95% of them, 19 out of 20, will perform as planned and contain$\mu$within their upper and lower bounds. Remember that you cannot be absolutely sure that any one particular interal contains the mean$\mu$. You will never know whether your particular interval is the one out of the 19 that "worked," or whether it is the one interval that "missed." Your confidence in the estimated interval follows from the fact that when repeated intervals are calculated, 95% of these intervals will contain$\mu$. A good confidence interval has two desirable characteristics: • It is as narrow as possible. The narrower the interval, the more exactly you have located the estimated parameter. • It has a large level of confidence, near 100%. The larger the confidence level, the more likely it is that the interval will contain the estimated parameter. Example: A researcher wants to estimate the average amount of time, in hours per day, that a U.S. teenager spends consuming media — watching TV, listening to music, surfing the Web, social networking, and playing video games. A random sample of$n=1000$U.S. teenagers were polled about the amount of time they spend daily consuming media. The results of the poll produced a mean amount of$7.6$hours with a standard deviation of 2.2 hours. Use this information to estimate the population mean. Use a 90% level of confidence. (Source: The Washington Post) Solution: The random variable,$x$represents the time, in hours per day, that a U.S. teenager spends consuming media. The point estimate of$\mu$is$\bar{x}=7.6$. The critical value of$z_{c}$can be found on the table of common values. (Alternatively, we could use the calculator,$z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=1.645$with$c=0.90$.) The margin of error is $E=1.645\cdot SE = 1.645\cdot\sigma_{\bar{x}}= 1.645\cdot\frac{\sigma}{\sqrt{n}}$ Since the sample size is large (greater than 30) the researcher can approximate the value of$\sigma$with$s. Therefore, the margin of error is approximately $E=1.645\cdot SE \doteq 1.645\cdot\frac{s}{\sqrt{n}}=1.645\cdot\frac{2.2}{\sqrt{1000}}\approx0.114 \text{ hours}$ The approximate 90% confidence interval is \eqalign{ \bar{x} & \pm \ E\cr 7.6 & \pm \ 0.114 } The 90% confidence interval for\mu$is from 7.49 hours to 7.71 hours per day. ## Finding$z_{c}$for the Margin of Error Example: Find the critical value$z_{c}$that must be used in the margin of error formula for a 92% confidence interval estimate for$\mu$. Solution: Notice in the figure below that the area under the standard normal distribution, left of a vertical line at$z_cis \eqalign{ \dfrac{1}{2}(1-c)+c & = \dfrac{1}{2} -\dfrac{1}{2}\cdot c+1\cdot c \cr & =\dfrac{1}{2} +\Bigg(-\dfrac{1}{2}c\Bigg)+1c \cr & = \dfrac{1}{2} +\Bigg(-\dfrac{1}{2}+1\Bigg)\cdot c \cr & = \dfrac{1}{2} + \dfrac{1}{2}\cdot c \cr & = \dfrac{1}{2} (1+c) } Use the "invnorm" command on the calculator $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.92)\Bigg)\doteq1.75$ Some of the TI calculators prompt you to enter values for\mu$and$\sigma$when using the invnorm command. If that is the case, make sure you are using the values of$\mu$and$\sigma$that correspond to the standard normal distribution of z scores. That is, use$\mu=0$and$\sigma=1$. ## Try This!! Find the critical value$z_{c}$that must be used in the margin of error formula for a 88% confidence interval estimate for$\mu$. Solution: Use the "invnorm" command on the calculator $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.88)\Bigg)\doteq1.55$ Some of the TI calculators prompt you to enter values for$\mu$and$\sigma$when using the invnorm command. If that is the case, make sure you are using the values of$\mu$and$\sigma$that correspond to the standard normal distribution of z scores. That is, use$\mu=0$and$\sigma=1$. Example: Find a 92% confidence interval estimate for$\mu$assuming sample size$n$was 200. Suppose$\sigma$was known to be 33.3 from past history and that the sample mean,$\bar{x}$was found to be 250.5. Solution: The point estimate of$\mu$is$\bar{x}=250.5$. The critical value of$z_{c}for the margin of error formula was found in the above example to be about 1.75. The margin of error is then \eqalign{ E & =z_{c}\cdot SE \cr & = z_{c}\cdot\sigma_{\bar{x}} \cr & = z_{c}\cdot\frac{\sigma}{\sqrt{n}} \cr & = 1.75\cdot\frac{33.3}{\sqrt{200}} \cr & \doteq 4.1 } The approximate 92% confidence interval is \eqalign{ \bar{x} & \pm \ E\cr 250.5 & \pm \ 4.1 } The 92% confidence interval for\mu$is from 246.4 to 254.6. ## Finding Point Estimate and E from a Confidence Interval When we use technology to estimate the confidence interval the result is often expressed as an interval, such as (18.256, 23.744). The sample mean$\bar{x}$is the value midway between those limits, and the margin of error E is one-half the difference between those limits (because the upper limit is$\bar{x}+E$and the lower limit is$\bar{x}-E, the distance separating them is 2E). \eqalign{ \text{Point estimate of } \mu : & \qquad \bar{x}=\dfrac{(\text{upper confidence limit})+(\text{lower confidence limit})}{2} \cr & \cr \text{Margin of Error: } & \qquad E=\dfrac{(\text{upper confidence limit})-(\text{lower confidence limit})}{2} } Example: The calculator screen below displays the results from counting the number of different types of donuts in a sample of 100. Use the given confidence interval to find the point estimate\bar{x}and the margin of error E. Solution: \eqalign{ \bar{x} &=\dfrac{(\text{upper confidence limit})+(\text{lower confidence limit})}{2} \cr &=\dfrac{23.744+18.256}{2}\cr & = 21 } and \eqalign{ E &=\text{upper confidence limit} - \text{point estimate} \cr &=(\bar{x}+E)-\bar{x}\cr &= 23.744-21\cr & = 2.744 } ## Using the Calculator to find the Confidence Interval The TI-83/84 Plus calculator can be used to generate confidence intervals for original sample values stored in a list, or you can use the summary statisticsn, \bar{x}$, and$\sigma$. Either enter the data in list L1 or have the summary statistics available, then press the STAT key. Now select TESTS and choose ZInterval. After making the required entries, the calculator display will include the confidence interval in the format of$(\bar{x}-E, \ \ \bar{x}+E)$## Finding a z Conﬁdence Interval for the Mean (Statistics) Example: A researcher wants to estimate the average amount of time, in hours per day, that a U.S. teenager spends consuming media — watching TV, listening to music, surfing the Web, social networking, and playing video games. A random sample of$n=1000$U.S. teenagers were polled about the amount of time they spend daily consuming media. The results of the poll produced a mean amount of$7.6$hours with a standard deviation of 2.2 hours. Use this information to estimate the population mean. Use a 90% level of confidence. (Source: The Washington Post) 1. Press STAT and move the cursor to TESTS. 2. Press 7 for ZInterval. 3. Move the cursor to Stats and press ENTER. 4. Type in the appropriate values. 5. Move the cursor to Calculate and press ENTER. Calculator Solution:  Conclusion: The 90% confidence interval for$\mu$is from 7.49 hours to 7.71 hours per day. ## Finding a z Conﬁdence Interval for the Mean (Data) Find the 95% confidence interval for the mean using this sample. 45 52 35 22 62 34 42 46 53 58 36 40 43 16 23 54 27 32 24 53 62 67 84 36 44 49 57 35 25 30 1. Enter the data into L1. (Press STAT$\to$ENTER to access L1) 2. Press STAT and move the cursor to TESTS. 3. Press 7 for ZInterval. 4. Move the cursor to Data and press ENTER. 5. Type in the appropriate values. 6. Move the cursor to Calculate and press ENTER The population standard deviation$\sigma$is unknown. Since the sample size is$n \geq30$, you can use the sample standard deviation$s$as an approximation for$\sigma$. After the data values are entered in L1 (step 1 above), press STAT, move the cursor to CALC, press 1 for 1-Var Stats, then press ENTER. The sample standard deviation of 15.50246551 will be one of the statistics listed. Then continue with step 2. At step 5 on the line for$\sigma$, press VARS for variables, press 5 for Statistics, press 3 for$S_x$.  Conclusion: The 95% confidence interval for$\mu$is from$37.3$to$\$48.4$.
Notice the output from ZIinterval command gives us the sample mean and sample standard deviation.

# Sample Size Determination

As the level of confidence is increased, the confidence interval widens. As the confidence interval widens, the precision of the estimate decreases. One way to improve the precision of an estimate without decreasing the level of confidence is to increase the sample size. But how large a sample size is needed to guarantee a certain level of confidence for a given margin of error? By using the formula for margin of error $$E = z_c\cdot\frac{\sigma}{\sqrt{n}}$$ a formula can be derived (by solving this formula for $n$) as shown in the next definition.

### Find a Minimum Sample Size to Estimate $\mu$

Given a $c$-confidence level and a margin of error E, the minimum sample size $n$ needed to estimate the population mean $\mu$ is $$n=\biggr(\frac{z_c\cdot\sigma}{E}\biggr)^2$$ Always round the value up to the next whole number. When $\sigma$ is unknown, you can estimate $n$ using using $s$, provided you have a preliminary sample with at least 30 members.

Example:  A pizza shop owner wishes to find the 95% confidence interval estimate for the true mean cost of a large pepperoni pizza. How large should the sample be if she wishes to be accurate within $\$0.15$? A previous study showed that the standard deviation of the price was$\$0.26$.

Solution:   \eqalign{ n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr & =\biggr(\frac{1.96\cdot\0.26}{\0.15}\biggr)^2 \cr & = 11.54187378 \color{red}{\text{ always round up!}}\cr & \doteq 12 } The minimum sample size required is 12 prices.

## Try This!!

Question Determine the minimum sample size required when you want to be 99% confident that the sample mean is within two units of the population mean and $\sigma=1.4$. Assume the population is normally distributed.

Solution:   We are given $\sigma= 1.4$, $E=2$ and $c=0.99$. Then $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.99)\Bigg)\doteq2.58$ and \eqalign{ n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr & =\biggr(\frac{2.58\cdot1.4}{2}\biggr)^2 \cr & = 3.261636 \color{red}{\text{ always round up!}}\cr & \doteq 4 } The minimum sample size required is 4 measurements.

## Try This!!

Question A beverage company uses a machine to fill one-liter bottles with water. Assume the population of volumes is normally distributed. The company wants to estimate the mean volume of water the machine is putting in the bottles within one milliliter ($mm$). Determine the minimum sample size required to construct a 96% confidence interval estimate for $\mu$. Assume the population standard deviation is 3 millimeters.

Solution:   We are given $\sigma= 3 mm$, $E=1 mm$ and $c=0.96$. Then $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.96)\Bigg)\doteq2.05$ and \eqalign{ n & =\biggr(\frac{z_c\cdot \sigma}{E}\biggr)^2 \cr & =\biggr(\frac{2.05\cdot3 mm}{1 mm}\biggr)^2 \cr & = 37.8225 \color{red}{\text{ always round up!}}\cr & \doteq 38 } The minimum sample size required is 38 one-liter bottles.

# Confidence Interval for a Population Proportion $p$

Many research experiments or sample surveys have as their objective the estimation of the proportion of people or objects in a large group that possess a certain, characteristic. Here are some examples:

• The proportion of Americans who have internet access
• The proportion of those who believe there is solid evidence that Earth is getting warmer
• The proportion of residents who closely follow the local news or who often discuss local crime

Each is a practical example of the binomial experiment, and the parameter to be estimated is the binomial proportion, $p$. When the sample size is large, $\hat{p}=\frac{x}{n}=\frac{\text{total number of successes}}{\text{total number of trials}}$ is the best point estimator for the population proportion $p$. Since its sampling distribution is approximately normal, with mean $p$ and standard error SE $= \sqrt{\frac{pq}{n}}$,   $\hat{p}$ can be used to construct a confidence interval according to the general approach given here.

### Formula for a $c$-percent Confidence Interval Estimate for a Population Proportion, $p$.

$\hat{p}-E$   <  $p$  <  $\hat{p}+E$

where $$E = \pm z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}$$ and $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)$ $z_{c}$ is the $z$ value corresponding to an area $\dfrac{1}{2}(1-c)$ in the right tail of a standard normal z distribution. Since $p$ and $q$ are unknown, their values are estimated using the best point estimators: $\hat{p}$ and $\hat{q}$ (where $\hat{q}=1-\hat{p}$). The sample size is considered large when the normal approximation to the binomial distribution is adequate — when both $$np>5 \quad \text{and} \quad nq>5$$ (but we don't have values for $p$ and $q$, so we use $\hat{p}$ and $\hat{q}$ and check that both $n\hat{p}>5$ and $n\hat{q}>5$). Example:   A reporter wants to estimate the percentage of residents in her city who would rate their city as an excellent place to live. The reporter wants to estimate each percentage by race or ethnicity. It is found that residents in her city are predominantly white, non-hispanic. A random sample of 120 people are selected from among those who categorized their race/ethnicity as "white, non-Hispanic," and 55% indicated they would rate their city as an excellent place to live. Another random sample of 120 "Hispanic" individuals is drawn and suppose 40% indicated they would rate their city as an excellent place to live. Use this information to estimate the population proportion of Hispanic residents who would rate their city as an excellent place to live. Use a 99% level of confidence. (Source: journalism.org)

Solution:   The random variable, $x$ represents the proportion of Hispanic residents among the 120 sampled who would rate their city as an excellent place to live. The point estimate of $p$ is $\hat{p}=0.40$. (There were $x=n\cdot\hat{p}=120\cdot0.40=48$ individual successes.)  Also, notice that both $n\hat{p}>5$ and $n\hat{q}>5$, so that the binomial population distribution we are sampling from is approximately normal. Then, the standard error (the standard deviation of the sampling distribution) can be approximated as $\sqrt{\frac{pq}{n}}\approx \sqrt{\frac{\hat{p}\hat{q}}{n}} = \sqrt{\frac{0.40\cdot0.60}{120}}\doteq0.0447213595$ The value for $z_c$ in the margin of error formula is found with the calculator using $z_c= invnorm\Bigg( \frac{1}{2}(1+c)\Bigg)= invnorm\Bigg( \frac{1}{2}(1+0.99)\Bigg) \doteq 2.58$. The margin of error is then approximated as $E= z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}=2.58\cdot0.0447213595\approx0.116$ The approximate 99% confidence interval is \eqalign{ \hat{p} & \pm \ E\cr 0.40 & \pm \ 0.116\cr } Then $0.40-0.116=0.284=28.4\%$ and $0.40+0.116=0.516=51.6\%.$

The 99% confidence interval for $p$ is from 28.4% to 51.6%.

## Try This!!

A reporter wants to estimate the percentage of U.S. households that have internet. A random sample of 4000 U.S. households found 2880 households that had internet. Construct a 98% confidence interval estimate of the percentage of U.S. households that have internet.

Solution:   The random variable, $x$ represents the number of U.S. households among the 4000 sampled who had internet. The point estimate of $p$ is $\hat{p}=\frac{x}{n}=\frac{2880}{4000}=0.72$. The standard error of the sampling distribution is approximated as $\sqrt{\frac{pq}{n}}\approx \sqrt{\frac{\hat{p}\hat{q}}{n}} = \sqrt{\frac{0.72\cdot0.28}{4000}}\doteq0.0070992957$ The value for $z_c= invnorm\Bigg( \frac{1}{2}(1+c)\Bigg)= invnorm\Bigg( \frac{1}{2}(1+0.98)\Bigg) \doteq 2.33$. The margin of error is then approximated as $E= z_{c}\cdot\sqrt{\frac{\hat{p}\hat{q}}{n}}=2.33\cdot0.0070992957\approx0.017$ The approximate 98% confidence interval is \eqalign{ \hat{p} & \pm \ E\cr 0.72 & \pm \ 0.017\cr } Then $0.72-0.017=0.703=70.3\%$ and $0.72+0.017=0.737=73.7\%.$

The 98% confidence interval for $p$ is from 70.3% to 73.7%.

## Calculator Example

Question     There were 200 nursing applications in a sample, and 12% of the applicants were male. Find the 99% confidence interval for the true proportion of male applicants.

Solution     $x$ is the number of applications by males in $n=200$ applications. We are given the sample proportion, $\hat{p}=12\%$. To get the value of $x$ for the calculator, we use $x=n\cdot\hat{p}=200(0.12) =24$.

1. Press the STAT button on your calculator and move the cursor to TESTS.
2. Press A (ALPHA, MATH) for 1-PropZlnt.
3. Type in the appropriate values.
4. Move the cursor to Calculate and press ENTER.  The 99% confidence interval estimate for the population percentage of male applicants is between 6.1% and 17.9%.

# Sample Size Determination

### Find a Minimum Sample Size to Estimate a Binomial Population Proportion, $p$

Given a $c$-confidence level and a margin of error E, the minimum sample size $n$ needed to estimate the population proportion, $p$ is $$n=\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2$$ where $\hat{q}=1-\hat{p}$ Always round the value up to the next whole number. This formula assumes you have preliminary or previous estimates of $\hat{p}$ and $\hat{q}$. If not, use $\hat{p}=0.50$ and $\hat{q}=0.50$ since the formula outputs its maximum value if we assume $\hat{p}=0.50$ and $\hat{q}=0.50$. Also note that for this formula the margin of error is a percentage (as a decimal), a unitless quantity.

Example:  You wish to estimate, with 90% confidence, the population proportion of U.S. adults who are confident in the stability of the U.S. banking system. Your estimate must be accurate to within 3% of the population proportion.

1. No preliminary or previous estimate is available. Find the minimum sample size needed.
2. Find the minimum sample size needed, using a prior study that found 43% of U.S. adults are confident in the stability of the system.
3. Compare the results from (a) and (b).

Solution (a):   No preliminary or previous estimate is available for $\hat{p}$ and $\hat{q}=0.50$, so we assume $\hat{p}$ and $\hat{q}=0.50$. We are given $E=3\%=0.03$ and $c=0.90$. Then $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.90)\Bigg)\doteq1.64$ and \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.5}\cdot{0.5}\cdot\biggr(\frac{1.64}{0.03}\biggr)^2 \cr & = 747.111111111 \color{red}{\text{ always round up!}}\cr & \doteq 748 } The minimum sample size required is 748 U.S. adults.

Solution (b):   A preliminary estimate for $\hat{p}$ was given as 0.43, so the estimate for $\hat{q}=1-\hat{p}=1-0.43=0.57$. We are given $E=3\%=0.03$ and $c=0.90$. Then \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.43}\cdot{0.57}\cdot\biggr(\frac{1.64}{0.03}\biggr)^2 \cr & = 732.4677333 \color{red}{\text{ always round up!}}\cr & \doteq 733 } The minimum sample size required is 733 U.S. adults.

Solution (c):   Having an estimate of the population proportion reduces the minimum sample size needed.

## Try This!!

Question Determine the minimum sample size required when you want to be 95% confident that the sample proportion is within two percentage points of the population proportion. Assume the population is normally distributed.

Solution:   No preliminary or previous estimate is available for $\hat{p}$ and $\hat{q}=0.50$, so we assume $\hat{p}$ and $\hat{q}=0.50$. We are given $E=2\%=0.02$ and $c=0.95$. Then $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.95)\Bigg)\doteq1.96$ and \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.50}\cdot{0.50}\cdot\biggr(\frac{1.96}{0.02}\biggr)^2 \cr & = 2401 } The minimum sample size required is 2401 measurements.

## Try This!!

Question A recent report by Pew Research Center estimated that 68% of Americans have smartphones and 45% have tablet computers. How large a sample is needed to estimate the true proportion of Americans with smartphones to within 4% with 86% confidence?

Solution:   A preliminary estimate for $\hat{p}$ was given as 0.68, so the estimate for $\hat{q}=1-\hat{p}=1-0.68=0.32$. We are given $E=4\%=0.04$ and $c=0.86$. Then $z_{c}=invnorm\Bigg(\dfrac{1}{2}(1+c)\Bigg)=invnorm\Bigg(\dfrac{1}{2}(1+0.86)\Bigg)\doteq1.48$ and \eqalign{ n & =\hat{p}\cdot\hat{q}\cdot\biggr(\frac{z_c}{E}\biggr)^2 \cr & ={0.68}\cdot{0.32}\cdot\biggr(\frac{1.48}{0.04}\biggr)^2 \cr & = 297.8944 \color{red}{\text{ always round up!}}\cr & \doteq 298 } The minimum sample size required is 298 Americans.

# Confidence Interval for a Population Mean $\mu$  (small samples)

Suppose you need to estimate the value of $\mu$ but it is impossible or impractical to collect a large sample. Then the estimation procedure outlined above is of no use. This section introduces the estimation procedure that can be used when the sample size is small. Small sample confidence intervals for binomial proportions will be omitted from our discussion.

# Student's $t$ Distribution

In discussing the sampling distribution of $\bar{x}$, we made these points:

• When the population distribution is normal, the sampling distribution of $\bar{x}$ and $z = (\bar{x} - \mu)/(\sigma/\sqrt{n})$) both have normal distributions, for any sample size.

• When the original sampled population is not normal, $\bar{x}$,  $z = (\bar{x} - \mu)/(\sigma/\sqrt{n})$, and $z \approx (\bar{x} - \mu)/(s/\sqrt{n})$ all have approximately normal distributions, if the sample size is large $(n\geq30)$.

Unfortunately, when the sample size n is small $(n \text{ less than 30})$, the statistic $(\bar{x} - \mu)/(\sigma/\sqrt{n})$ does not have a normal distribution. Therefore, all the critical values of $z$ that you used before are no longer correct. For example, you cannot say that $\bar{x}$ will lie within 1.96 standard errors of $\mu$   95% of the time. This problem is not new; it was studied by statisticians and experimenters in the early 1900s. To find the sampling distribution of this statistic, there are two ways to proceed:

• Use an empirical approach. Draw repeated samples and compute $(\bar{x} - \mu)/(s/\sqrt{n})$ for each sample. The relative frequency distribution that you construct using these values will approximate the shape and location of the sampling distribution.
• Use a mathematical approach to derive the actual probability density function (the curve that describes the standardized sampling distribution of $\bar{x}$.

The second approach was used by an Englishman named W. S. Gosset in 1908. He derived a complicated formula for the density function of

$t=\frac{\bar{x} - \mu}{s/\sqrt{n}}$

for random samples of size $n$ from a normal population, and he published his results under the pen name ''Student." Ever since, the statistic has been known as Student's $t$. It has the following characteristics:

• The mean, median and mode are all equal to zero.
• The total area under the $t$-distribution curve is 1.
• It is mound-shaped and symmetric about $t = 0$, just like the Standard Normal distribution of $z$.
• It is more variable than $z$ with "heavier tails"; that is, the $t$ curve does not approach the horizontal axis as quickly as $z$ does. This is because the $t$ statistic involves two random quantities, $\bar{x}$ and $s$, whereas the $z$ statistic involves only the sample mean, $\bar{x}$. You can see this phenomenon in the worksheet below.
• The shape of the $t$ distribution depends on the sample size $n$. As $n$ increases, the variability of $t$ decreases because the estimates of $s$ of $\sigma$ is based on more and more information. Eventually, when $n$ is infinitely large, the $t$ and $z$ distributions are identical!

Gosset was a Guinness Brewery employee who needed a distribution that could be used with small samples taken from a normal distributed population. The Irish brewery where he worked did not allow the publication of research results, so Gosset published under the pseudonym "Student."

The divisor $(n - 1)$ in the formula for the sample variance $s^2$ is called the number of degrees of freedom (df) associated with $s^2$.   It determines the shape of the $t$ distribution. The origin of the term degrees of freedom is theoretical and refers to the number of independent squared deviations in $s^2$ that are available for estimating $\sigma^2$.  These degrees of freedom may change for different applications, and since they specify the correct $t$ distribution to use, you need to remember to calculate the correct degrees of freedom for each application.

The table of normal probabilities for the standard normal $z$ distribution is no longer useful in calculating critical values for the margin of error in your confidence interval formula. Instead, you will use the $t$-table (below). The table body lists critical values of $t_{c}$. The first column of the table is a particular number of degrees of freedom. The top row has a percentage area to the left of a vertical line at $t_c$.

# (pdf )

### The percentage at the top of the table is equal to the AREA to the LEFT of the critical value, $t_{cv}$, found in the table body. $$\begin{array} {r|@{\quad}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r@{\,}r} \ \\ df&60.0\%&66.7\%&75.0\%&80.0\%&87.5\%&90.0\%&95.0\%&97.5\%&99.0\%&99.5\% &99.9\% \\ \hline \ \\ 1&0.325&0.577&1.000&1.376&2.414&3.078&6.314&12.706&31.821&63.657&318.31 \\ 2&0.289&0.500&0.816&1.061&1.604&1.886&2.920&4.303&6.965&9.925&22.327 \\ 3&0.277&0.476&0.765&0.978&1.423&1.638&2.353&3.182&4.541&5.841&10.215 \\ 4&0.271&{0.464}&0.741&0.941&1.344&1.533&2.132&2.776&3.747&\color{red}{4.604}&7.173 \\ 5&0.267&0.457&0.727&0.920&1.301&1.476&2.015&2.571&3.365&4.032&5.893 \\ 6&0.265&0.453&0.718&0.906&1.273&1.440&1.943&2.447&3.143&3.707&5.208 \\ 7&0.263&0.449&\color{red}{0.711}&0.896&1.254&1.415&1.895&2.365&2.998&3.499&4.785 \\ 8&0.262&0.447&0.706&0.889&1.240&1.397&1.860&2.306&2.896&3.355&4.501 \\ 9&0.261&0.445&0.703&0.883&1.230&1.383&1.833&2.262&2.821&3.250&4.297 \\ 10&0.260&0.444&0.700&0.879&1.221&\color{red}{1.372}&1.812&2.228&2.764&3.169&4.144 \\ 11&0.260&0.443&0.697&0.876&1.214&1.363&1.796&2.201&2.718&3.106&4.025 \\ 12&0.259&0.442&0.695&0.873&1.209&1.356&1.782&2.179&2.681&3.055&3.930 \\ 13&0.259&0.441&0.694&0.870&1.204&1.350&1.771&2.160&2.650&3.012&3.852 \\ 14&0.258&0.440&0.692&0.868&1.200&1.345&1.761&\color{red}{2.145}&2.624&2.977&3.787 \\ 15&0.258&0.439&0.691&0.866&1.197&1.341&1.753&2.131&2.602&2.947&3.733 \\ 16&0.258&0.439&0.690&0.865&1.194&1.337&1.746&2.120&2.583&2.921&3.686 \\ 17&0.257&0.438&0.689&0.863&1.191&1.333&1.740&2.110&2.567&2.898&3.646 \\ 18&0.257&0.438&0.688&0.862&1.189&1.330&1.734&2.101&2.552&2.878&3.610 \\ 19&0.257&0.438&0.688&0.861&1.187&1.328&1.729&2.093&2.539&2.861&3.579 \\ 20&0.257&0.437&0.687&0.860&1.185&1.325&1.725&2.086&2.528&2.845&3.552 \\ 21&0.257&0.437&0.686&0.859&1.183&1.323&1.721&2.080&2.518&2.831&3.527 \\ 22&0.256&0.437&0.686&0.858&1.182&1.321&\color{red}{1.717}&2.074&2.508&2.819&3.505 \\ 23&0.256&0.436&0.685&0.858&1.180&1.319&1.714&2.069&2.500&2.807&3.485 \\ 24&0.256&0.436&0.685&0.857&1.179&1.318&1.711&2.064&2.492&2.797&3.467 \\ 25&0.256&0.436&0.684&0.856&1.178&1.316&1.708&2.060&2.485&2.787&3.450 \\ 26&0.256&0.436&0.684&0.856&1.177&1.315&1.706&2.056&2.479&2.779&3.435 \\ 27&0.256&0.435&0.684&0.855&1.176&1.314&1.703&2.052&2.473&2.771&3.421 \\ 28&0.256&0.435&0.683&0.855&1.175&1.313&1.701&2.048&2.467&2.763&3.408 \\ 29&0.256&0.435&0.683&0.854&1.174&1.311&1.699&2.045&2.462&2.756&3.396 \\ 30&0.256&0.435&0.683&0.854&1.173&1.310&1.697&2.042&2.457&2.750&3.385 \\ 35&0.255&0.434&0.682&0.852&1.170&1.306&1.690&2.030&2.438&2.724&3.340 \\ 40&0.255&0.434&0.681&0.851&1.167&1.303&1.684&2.021&2.423&2.704&3.307 \\ 45&0.255&0.434&0.680&0.850&1.165&1.301&1.679&2.014&2.412&2.690&3.281 \\ 50&0.255&0.433&0.679&0.849&1.164&1.299&1.676&2.009&2.403&2.678&3.261 \\ 55&0.255&0.433&0.679&0.848&1.163&1.297&1.673&2.004&2.396&2.668&3.245 \\ 60&0.254&0.433&0.679&0.848&1.162&1.296&1.671&2.000&2.390&2.660&3.232 \\ \infty &0.253&0.431&0.674&0.842&1.150&1.282&1.645&1.960&2.326&2.576&3.090 \end{array}$$

Example 1:   For a $t$ distribution with 10 degrees of freedom, the value of $t$ that has an area 0.90 to its left is found in row 10 in the column marked "90%." You should verify that this is $\color{red}{t=1.372}$

Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:

1. Press 2nd then vars.
2. Select the invT function by pressing 4
3. Insert values for area and degrees of freedom, $df$ (where $df=n-1$).
Note that the calculator expects you to input an area left of the unknown value of $t$.
4. Highlight the word "paste" then press enter. This pastes the command "invT(area, df)" over to the home screen.
5. After the command is pasted, press enter again    Example 2:   Find the $t$ value that represents the $t$-score in the 50th percentile.

Solution:   That value of $t$ will be 0 for any sample size, since every $t$ distribution curve is centered at the origin.

Example 3:   Find the $t$ value that represents the first quartile of $t$-scores. Assume $n=8.$

Solution:   The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 7$; so, the critical value is in the 7th row of the table. The $t$ value that represents the first quartile of $t$-scores is the value separating the lower 25% of $t$-scores from the upper 75%. The $t$-value we are looking for must be in the lower portion of the distribution, with area 25% to its left, as shown in the figure below. Since the $t$ distribution is symmetric about 0, this value is simply the negative (opposite) of the $t$-value that has an area of 0.75 to its left, or $\color{red}{-t = -0.711}$. Notice the table won't give you negative values, instead we have to find the $t$-value that has an area of 1-0.25 = 0.75 to its left and find it's opposite. Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:

1. Press 2nd then vars.
2. Select the invT function by pressing 4
3. Insert values for area and degrees of freedom, $df$ (where $df=n-1$).
Note that the calculator expects you to input an area left of the unknown value of $t$
4. Highlight the word "paste" then press enter. This pastes the command "invT(area, df)" over to the home screen.
5. After the command is pasted, press enter again    Example 4:   Suppose you have a sample of size 15 from a normal distribution. Find a value of $t$ such that only 2.5% of all values of $t$ will be smaller.

Solution:   The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 14$, and the necessary $t$-value must be in the lower portion of the distribution, with area 2.5% to its left, as shown in the figure below. Since the $t$ distribution is symmetric about 0, this value is simply the negative of the value on the right-hand side with area 0.975 to its left, or $-t = \color{red}{-2.145}$.    Values of $t$ given in the $t$-table are output values from Gosset's complicated formula that produces the standardized sampling distribution curve. These values have been rounded three decimal place value columns (to the thousandths).

Example 6:   Find the value of $t_c$ needed to set up a 90% percent confidence interval estimate for $\mu$. Assume the sample size is 23.

Solution:   The degrees of freedom that specify the correct $t$ distribution are $df = n - 1 = 23 - 1 =22.$ The area under the standardized sampling distribution of $t$-scores just left of a vertical line at $t_c$ is $\frac{1}{2}(1-c)+c$; or equivalently, $\frac{1}{2}(1+c) = \frac{1}{2}(1+0.90) = 0.95$. Therefore, the value of $t$ we need can be found in row 22 of the table, in the column labeled "95%." This value is $\color{red}{1.717}$ Alternatively, you can find this critical value of $t$ using the $\color{red}{invT}$ (the t inverse) function on the TI83/84+ calculator:

1. Press 2nd then vars.
2. Select the invT function by pressing 4
3. Insert values for area and degrees of freedom, $df$ (where $df=n-1$).
Note that the calculator expects you to input the value given by the formula $\frac{1}{2}(1+c)$ for area.
4. Highlight the word "paste" then press enter. This pastes the command "invT(area, df)" over to the home screen.
5. After the command is pasted, press enter again   A common mistake for those using the invT function on the calculator is that they input the confidence level, $c$, for the area instead of inputting $\frac{1}{2}(1+c)$ for area. Try to avoid this mistake.

### Formula for the $c$-percent Confidence Interval Estimate for a Population Mean $\mu$  ($\sigma$ unknown).

$\bar{x}-E$   <  $\mu$  <  $\bar{x}+E$

where $E= t_{c}\cdot\frac{s}{\sqrt{n}}$ with \eqalign{ n &= \text{sample size,}\cr s &= \text{sample standard deviation}\cr c&= \text{the confidence level}\cr df&= n-1 } and $t_{c}$ is the $t$ value on the $(n-1)^{st}$ row of $t$-table that has an area of $\frac{1}{2}(1-c)+c=\frac{1}{2}(1+c)$, left of a vertical line at $t_c$.
For the TI84+ calculator, $t_{c}=invT\Bigg( area, \ df \Bigg)$, where \eqalign{ area &= \frac{1}{2}(1+c)\cr df&= n-1 } Sadly, there is no invT function on the TI83/83+, and some TI84+ don't have the invT command.
When the sample size is large $(n\geq30)$ the critical values on the $t$-table approach the same critical values of $z$ on the standard normal distribution table.

Example:  Suppose that for a random sample of 5 computers at a certain electronics store, the mean repair cost was $\$178$. The sample standard deviation was$\$32$. Assume the population is normally distributed. Construct a 99% confidence interval estimate for the population mean repair cost.