A Quick and Dirty Introduction to R

Last updated on Jan 27, 2022 R, tidyverse

Some Data

I will start with some inline data.

library(tidyverse); library(skimr);
Support.Times <- structure(list(Screened = c(26.9, 28.4, 23.9, 21.8, 22.4, 25.9, 
26.5, 20, 23.7, 23.7, 22.6, 19.4, 27.3, 25.3, 27.7, 25.3, 28.4, 
24.2, 20.4, 29.6, 27, 23.6, 18.3, 28.1, 20.5, 24.1, 27.2, 26.4, 
24.5, 25.6, 17.9, 23.5, 25.3, 20.2, 26.3, 27.9), Not.Screened = c(24.7, 
19.1, 21, 17.8, 22.8, 24.4, 17.9, 20.5, 20, 26.2, 14.5, 22.4, 
21.1, 24.3, 22, 24.3, 23.9, 19.6, 23.8, 29.2, 19.7, 20.9, 25.2, 
22.5, 23.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -36L))

Now I will use the tidyverse to stack it. This can also be done with stack(Support.Times).

stack(Support.Times) %>% drop_na()

Using the tidyverse, the new data SSTimes will stack the data using pivot longer into two variables that I will name Self.Screen and Call.Time to store the stacked data. The final command drops the missing data. Then I will group them and skim them.

SSTimes <- Support.Times %>% pivot_longer(., c(Screened,Not.Screened), names_to = "Self.Screen", values_to = "Call.Time") %>% drop_na()
SSTimes %>% group_by(Self.Screen) %>% skim()

Table 1: Data summary
Name	Piped data
Number of rows	61
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	Self.Screen

Variable type: numeric

skim_variable	Self.Screen	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Call.Time	Not.Screened	0	1	22.04	3.11	14.5	20.00	22.4	24.30	29.2	▁▅▇▇▁
Call.Time	Screened	0	1	24.44	3.08	17.9	22.55	24.9	26.92	29.6	▃▃▆▇▅

So I have 25 observations that are not screened and 36 that are screened. What does it look like?

ggplot(SSTimes, aes(x=Self.Screen, y=Call.Time, fill=Self.Screen)) + geom_violin(alpha = 0.2) + scale_fill_discrete(guide=FALSE) + labs(title = "Self Screen and Non-Self Screen Call Times")

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Here is a picture of the distributions of the two means.

gplots::plotmeans(Call.Time~Self.Screen, data=SSTimes, n.label=FALSE, ci.label=TRUE, ylim=c(20,27))

What does the t-test look like?

t.test(Support.Times$Not.Screened, Support.Times$Screened)

## 
##  Welch Two Sample t-test
## 
## data:  Support.Times$Not.Screened and Support.Times$Screened
## t = -2.9793, df = 51.512, p-value = 0.004399
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.0216630 -0.7841148
## sample estimates:
## mean of x mean of y 
##  22.03600  24.43889

t.test(Call.Time~Self.Screen, data=SSTimes)

## 
##  Welch Two Sample t-test
## 
## data:  Call.Time by Self.Screen
## t = -2.9793, df = 51.512, p-value = 0.004399
## alternative hypothesis: true difference in means between group Not.Screened and group Screened is not equal to 0
## 95 percent confidence interval:
##  -4.0216630 -0.7841148
## sample estimates:
## mean in group Not.Screened     mean in group Screened 
##                   22.03600                   24.43889

It is worth noting that R stores a bunch of stuff. For example, it stores the standard error of the difference and that is worth looking at in this case; the standard error that describes the difference in the averages is 0.8065242.

Resample.Times <- ResampleProps::ResampleDiffMeans(Support.Times$Screened,Support.Times$Not.Screened)
sd(Resample.Times)

## [1] 0.8073286

GSM Data

Robert W. Walker

Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.