1 and 2 Sample Means-Proportions
The Basic Idea
Two classes of statistics have known distributions.
- Means have a t distribution.
- Proportions have a normal distribution if the expected number of both categories exceeds 5.
The Mean: t
The t distribution - is entirely defined by degrees of freedom. - has as metric, the standard error [in this case of the mean]
The equations: \[ \Large t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \] and \[ \Large \mu = \overline{x} + t(\frac{s}{\sqrt{n}}) \] The true mean is symmetric about the sample mean with t defining the number of standard errors of the mean above and below.
The t distribution (df=24)
## Loading required package: car
## Loading required package: carData
## Loading required package: sandwich
# 99 percent, then 95, 90, and 80
qt(c(0.005,0.995), df=24)
## [1] -2.79694 2.79694
qt(c(0.025,0.975), df=24)
## [1] -2.063899 2.063899
qt(c(0.05,0.95), df=24)
## [1] -1.710882 1.710882
qt(c(0.1,0.9), df=24)
## [1] -1.317836 1.317836
LED Lifetimes (df=24)
# 99 percent, then 95, 90, and 80
mean(LEDLifetimes$lifetime)+sd(LEDLifetimes$lifetime)/sqrt(25)*qt(c(0.005,0.995), df=24)
## [1] 36881.2 39118.8
## [1] 37174.43 38825.57
## [1] 37315.64 38684.36
## [1] 37472.86 38527.14
The Complete Picture
***
with(LEDLifetimes, (t.test(lifetime, alternative='two.sided', mu=0.0,
conf.level=.95)))
##
## One Sample t-test
##
## data: lifetime
## t = 94.998, df = 24, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 37174.43 38825.57
## sample estimates:
## mean of x
## 38000
Hypothesis Tests
Assume a true mean \(\mu\). How likely, if \(\mu\) is true, is the sample mean that we obtained? Now, \(\mu\), \(\overline{x}\), \(s\), and \(n\) are known. We need only figure out the question and the associated probability. Do we wish to know…. - whether \(\mu\) is equal to some value (two-sided alternative). - whether \(\mu\) is greater than some value. - whether \(\mu\) is less than some value.
We will examine the probability of the complements. We find a sample mean of 36800 given a sample size of 16? -2.4. The probabilities are - 0.0149 with inequality [one-tail] - 0.0298 with not equal to [two-tails].
\[ \Large t = \frac{36800 - 38000}{\frac{2000}{\sqrt{16}}} = -2.4 \]
============================= left: 40% title: false
***
N=25 [the data]
with(LEDLifetimes, (t.test(lifetime, alternative='two.sided', mu=36800,
conf.level=.95)))
##
## One Sample t-test
##
## data: lifetime
## t = 2.9999, df = 24, p-value = 0.006207
## alternative hypothesis: true mean is not equal to 36800
## 95 percent confidence interval:
## 37174.43 38825.57
## sample estimates:
## mean of x
## 38000
Solving for n
The equations: \[ \Large t = \frac{\overline{x} - \mu}{\frac{s}{\sqrt{n}}} \] and \[ \Large n = [\frac{st}{(\overline{x} - \mu)}]^2 \] We take the size of the necessary confidence interval and divide by the acceptable and assumed margin for error and square the result. It’s not perfect; the realized margin of error depends on the sample as does \(s\). This is hard to solve because t depends on \(n\).
We can approximate with a normal; it will be less wrong as the required sample gets big.
Proportions (with a caveat)
There are restrictions on when this works (np and n(1-p) greater than 5).
It allows us to, given data, ask for a reasonable range of the true probability \(\pi\) or \(p_0\) in a binomial. It is heavily deployed in survey sampling [especially in the polling silly season].
Equations for Proportions
The equations: \[ \Large z = \frac{\hat{p} - \pi}{\sqrt(\frac{p(1-p)}{n}} \] and \[ \Large \pi = \hat{p} + z\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
- \(\pi\) – the true probability of a positive response.
- \(\hat{p}\) – the estimated probability/proportion of positive responses.
- \(z\) – quantiles from the standard normal distribution.
- \(n\) – the number of respondents
Food Satisfaction
- 100 respondents
- 46 satisfied
- What confidence?
- 95 percent lower bound? \(z=-1.645\)
- 95 percent upper bound? \(z=1.645\)
- 95 percent central interval? \(z=(-1.96,1.96)\)
- \(\pi\) is unknown.
\[ \Large \pi = 0.46 + z\sqrt{\frac{0.46(1-0.46)}{100}} \] \[ \Large \pi = 0.46 + z*0.05 \] - What confidence? - 95 percent lower bound? \(z=-1.645\) 0.378 - 95 percent upper bound? \(z=1.645\) 0.542 - 95 percent central interval? \(z=(-1.96,1.96)\) 0.362, 0.558
The Distribution
left: 70%
***
I reordered the outcomes.
================================ title: false
##
## 1-sample proportions test without continuity correction
##
## data: rbind(.Table), null probability 0.5
## X-squared = 0.64, df = 1, p-value = 0.4237
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3656081 0.5573514
## sample estimates:
## p
## 0.46
p.hat <- 0.46
p.hat+qnorm(c(0.025,0.975))*sqrt(p.hat*(1-p.hat)/100)
## [1] 0.3623159 0.5576841
The probability of 0.5
##
## 1-sample proportions test without continuity correction
##
## data: rbind(.Table), null probability 0.5
## X-squared = 0.64, df = 1, p-value = 0.2119
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.5419527
## sample estimates:
## p
## 0.46
p <- 0.5; p.hat <- 0.46; (p.hat - p)/sqrt(p*(1-p)/100)
## [1] -0.8
pnorm((p.hat - p)/sqrt(p*(1-p)/100))
## [1] 0.2118554
2*pnorm((p.hat - p)/sqrt(p*(1-p)/100))
## [1] 0.4237108
p.hat+qnorm(0.95)*sqrt(p^2/100)
## [1] 0.5422427
The Exact Binomial
Solves a problem in this way. What would \(p\) have to be to generate this set of binomial outcomes with the given probability? In this instance, it runs through the possible values of p so that - 46 or more yes’s are 97.5 percent and then that - 46 or fewer yes’s are 2.5 percent.
pbinom(45, size=100, prob=0.3598434)
## [1] 0.9749999
pbinom(46, size=100, prob=0.5625884)
## [1] 0.02500002
======================================= title: false
##
## Frequency counts (test is for first level):
## Satisfied.
## Yes No
## 46 54
##
## Exact binomial test
##
## data: rbind(.Table)
## number of successes = 46, number of trials = 100, p-value = 0.4841
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3598434 0.5625884
## sample estimates:
## probability of success
## 0.46
The Famous Survey Margin for Error
Find one…
\[ n = \frac{z^2*SE(p)^2}{MOE^2} \\ MOE = \sqrt{\frac{z*\sqrt{SE(p)}}{N}} \] ***
sqrt(1.96*0.5/517)
## [1] 0.04353793
sqrt(1.96*0.5/701)
## [1] 0.03738988
sqrt(1.96*0.5/657)
## [1] 0.03862161
sqrt(1.96*0.5/902)
## [1] 0.03296171
If we designed an Oregon poll for 3 percent error with 95% confidence.
- The needed z with 95 percent confidence is 1.96
- The MOE is 0.03
- Assume a tie [p=0.5].
- We need 1068. You should round people up….
1.96^2*0.5^2/0.03^2
## [1] 1067.111
From One-Sample to Two
The ideas of confidence intervals and hypothesis tests also extend to comparisons among samples. Before developing these ideas, we need to introduce the key idea of covariance.
Sample covariance is the shared variation in two observed variables defined as (with metric (xy)) - populations would substitute \(\mu\) for \(\overline{x,y}\):
\[ \Large Cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n} (x_{i} - \overline{x})(y_{i} - \overline{y}) \]
The idea of correlation is related, we divide by the standard deviation of the two variables to render it metricless [and absolute between -1 and 1].
\[ \Large Cor(x,y) = \frac{\sum_{i=1}^{n} (x_{i} - \overline{x})(y_{i} - \overline{y})}{s_{x}s_{y}} \]
Covariance Matters
We can measure covariance. But the measured covariance depends on the mean.
- If we want to ask if means are different, we must assume something about covariance. Either:
- Independent (Independent sampling)
- Dependent (Paired/Matched Sampling)
Are the units sampled in independent or dependent fashion?
The Equations
Are given in your text on 217–221. The same duality exists here between \(\hat{\pi}\) versus hypotheses about \(\pi\). The way this is handled when one claims no difference is different than when one claims a difference [see section 6.2.4, p.222].
Simulation renders much of this silly.
Statstics and Simulation
The big idea is that we simulate things because we can. In old school statistics, the limitations were imposed by the difficulty and need for obtaining analytical solutions. We can use the computer and sampling to replace these arcane mathematical troubles.
Resampling a Proportion
This idea applies to proportions based on binary data [and next week, to means]. Take the example of CreditProducts. Let me first show a table of the data and embed the function.
##
## No Yes
## 56 144
The command resample.prop
requires a vector of data with two outcomes of whatever form. Yes and No, True and False, Up and Down, or 0 and 1, or even 0 and 10000. As long as it is binary, the code will work. It also has a key option, tab.col
which embeds whether you want the first 1
or second 2
column of the table. Here, I will use 2 because I want the probability of Yes on credit products.
Cred.Prod.Res <- resample.prop(CreditProducts$Credit, tab.col=2)
binom.test(144,200)
##
## Exact binomial test
##
## data: 144 and 200
## number of successes = 144, number of trials = 200, p-value = 4.015e-10
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.6523113 0.7810388
## sample estimates:
## probability of success
## 0.72
prop.test(144,200)
##
## 1-sample proportions test with continuity correction
##
## data: 144 out of 200, null probability 0.5
## X-squared = 37.845, df = 1, p-value = 7.659e-10
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.6514606 0.7799182
## sample estimates:
## p
## 0.72
quantile(Cred.Prod.Res, c(0.025,0.975))
## 2.5% 97.5%
## 0.655000 0.780125
The probability distributions are all very similar. They all attempt to capture the same idea.
Two Independent Sample Comparisons
Embed the same logic. First, I want a random \(\pi\) from sample 1. Then I want a random \(\pi\) from sample 2. I want to measure the difference between these two random means. Why measure them separately? Because I do not know who matches with whom and, oftentimes, we have different size samples and we do not wish to discard relevant information from which to sample these means.
Two Independent Proportions
We have a proportions test to examine whether the probability of yes [in a binomial] is the same or different in two samples. First, let me illustrate the workflow.
Now I take two samples of binary data and take a random sample of each, calculate the sample proportions, and subtract one from the other. For this example, let me use data on Defaults. 0 is no default; 1 is a default. I will again need the second column.
table(Defaults$Commercial)
##
## 0 1
## 169 31
table(Defaults$Consumer)
##
## 0 1
## 86 19
The command is resample.ind.prop
that requires two binary vectors as inputs.
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(31, 19) out of c(200, 105)
## X-squared = 0.17549, df = 1, p-value = 0.6753
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.12230938 0.07040462
## sample estimates:
## prop 1 prop 2
## 0.1550000 0.1809524
## 2.5% 97.5%
## -0.11859524 0.06119048
My simulated interval is a bit tighter; the z is an approximation.
Why It Matters: Independence
We want to know if two groups are the same or different in terms of the underlying probability \(\pi\) that describes the binomial. Why? If they are the same, then whatever it is that determines the two groups can be thought independent of \(\pi\). Knowing which group does not matter.
An Example: Berkeley
data("UCBAdmissions")
mosaicplot(apply(UCBAdmissions, c(1, 2), sum),
main = "Student admissions at UC Berkeley")
Or
opar <- par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
for(i in 1:6)
mosaicplot(UCBAdmissions[,,i],
xlab = "Admit", ylab = "Sex",
main = paste("Department", LETTERS[i]))
mtext(expression(bold("Student admissions at UC Berkeley")),
outer = TRUE, cex = 1.5)
par(opar)
The Statistics
Now to the statistics. But first, \(\chi^2\). It’s a probability distribution. It is entirely defined by degrees of freedom. And it is derived from a squared normal. If I have to calculate two proportions, I consume two degrees of freedom; if only one proportion need be calculated[meaning they are the same] then only one. The difference in degrees of freedom is one. That’s our \(\chi^2\) parameter: df.
Let’s Analyse This
table(UCB.Admit$M.F,UCB.Admit$Admit)
##
## No Yes
## Female 1278 557
## Male 1493 1198
prop.table(table(UCB.Admit$M.F,UCB.Admit$Admit), 1)
##
## No Yes
## Female 0.6964578 0.3035422
## Male 0.5548123 0.4451877
# Tests work across the rows.
prop.test(table(UCB.Admit$M.F,UCB.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCB.Admit$M.F, UCB.Admit$Admit)
## X-squared = 91.61, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.1129887 0.1703022
## sample estimates:
## prop 1 prop 2
## 0.6964578 0.5548123
So Women are seen, with 95% confidence, to be rejected by Berkeley 0.113 to 0.170 more often, expressed in a difference in probability metric. Moreover, the probability of Admission is not independent of Male and Female.
What About by Department?
UCBS.Admit = subset(UCB.Admit, subset=Dept=="A")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 19 89
## Male 313 512
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.1759259 0.8240741
## Male 0.3793939 0.6206061
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 16.372, df = 1, p-value = 5.205e-05
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.2877797 -0.1191564
## sample estimates:
## prop 1 prop 2
## 0.1759259 0.3793939
UCBS.Admit = subset(UCB.Admit, subset=Dept=="B")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 8 17
## Male 207 353
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.3200000 0.6800000
## Male 0.3696429 0.6303571
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 0.085098, df = 1, p-value = 0.7705
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.2577106 0.1584249
## sample estimates:
## prop 1 prop 2
## 0.3200000 0.3696429
UCBS.Admit = subset(UCB.Admit, subset=Dept=="C")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 391 202
## Male 205 120
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.6593592 0.3406408
## Male 0.6307692 0.3692308
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 0.63322, df = 1, p-value = 0.4262
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.03865948 0.09583940
## sample estimates:
## prop 1 prop 2
## 0.6593592 0.6307692
UCBS.Admit = subset(UCB.Admit, subset=Dept=="D")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 244 131
## Male 279 138
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.6506667 0.3493333
## Male 0.6690647 0.3309353
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 0.22159, df = 1, p-value = 0.6378
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.08702248 0.05022631
## sample estimates:
## prop 1 prop 2
## 0.6506667 0.6690647
UCBS.Admit = subset(UCB.Admit, subset=Dept=="E")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 299 94
## Male 138 53
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.7608142 0.2391858
## Male 0.7225131 0.2774869
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 0.80805, df = 1, p-value = 0.3687
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04181911 0.11842143
## sample estimates:
## prop 1 prop 2
## 0.7608142 0.7225131
UCBS.Admit = subset(UCB.Admit, subset=Dept=="F")
table(UCBS.Admit$M.F,UCBS.Admit$Admit)
##
## No Yes
## Female 317 24
## Male 351 22
prop.table(table(UCBS.Admit$M.F,UCBS.Admit$Admit), 1)
##
## No Yes
## Female 0.92961877 0.07038123
## Male 0.94101877 0.05898123
# Tests work across the rows.
prop.test(table(UCBS.Admit$M.F,UCBS.Admit$Admit))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: table(UCBS.Admit$M.F, UCBS.Admit$Admit)
## X-squared = 0.21824, df = 1, p-value = 0.6404
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05038231 0.02758231
## sample estimates:
## prop 1 prop 2
## 0.9296188 0.9410188
One Case Shows a Difference
And it goes in the wrong direction
Women are less likely to be rejected in Dept A. The rest suggest Admission is independent of M.F