Datasaurus Dozen

Last updated on Apr 17, 2021 3 min read tidyTuesday, dataviz

The `datasaurus dozen`

The datasaurus sozen is a fantastic teaching resource for examining the importance of data visualization. Let’s have a look.

datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   dataset = col_character(),
##   x = col_double(),
##   y = col_double()
## )

Two libraries to make our work easy.

library(tidyverse)
library(skimr)

First, the summary statistics.

datasaurus %>% group_by(dataset) %>% skim()

Table 1: Data summary
Name	Piped data
Number of rows	1846
Number of columns	3
_______________________
Column type frequency:
numeric	2
________________________
Group variables	dataset

Variable type: numeric

skim_variable	dataset	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
x	away	1	54.27	16.77	15.56	39.72	53.34	69.15	91.64	▁▇▃▇▁
x	bullseye	1	54.27	16.77	19.29	41.63	53.84	64.80	91.74	▂▆▇▅▂
x	circle	1	54.27	16.76	21.86	43.38	54.02	64.97	85.66	▅▃▇▅▃
x	dino	1	54.26	16.77	22.31	44.10	53.33	64.74	98.21	▅▇▇▅▂
x	dots	1	54.26	16.77	25.44	50.36	50.98	75.20	77.95	▂▁▇▁▅
x	h_lines	1	54.26	16.77	22.00	42.29	53.07	66.77	98.29	▅▇▇▅▁
x	high_lines	1	54.27	16.77	17.89	41.54	54.17	63.95	96.08	▂▅▇▃▁
x	slant_down	1	54.27	16.77	18.11	42.89	53.14	64.47	95.59	▂▅▇▃▁
x	slant_up	1	54.27	16.77	20.21	42.81	54.26	64.49	95.26	▃▆▇▃▂
x	star	1	54.27	16.77	27.02	41.03	56.53	68.71	86.44	▅▇▇▃▆
x	v_lines	1	54.27	16.77	30.45	49.96	50.36	69.50	89.50	▃▇▁▅▁
x	wide_lines	1	54.27	16.77	27.44	35.52	64.55	67.45	77.92	▇▂▁▇▅
x	x_shape	1	54.26	16.77	31.11	40.09	47.14	71.86	85.45	▇▆▁▃▅
y	away	1	47.83	26.94	0.02	24.63	47.54	71.80	97.48	▅▆▃▇▃
y	bullseye	1	47.83	26.94	9.69	26.24	47.38	72.53	85.88	▇▆▃▅▇
y	circle	1	47.84	26.93	16.33	18.35	51.03	77.78	85.58	▇▁▁▂▆
y	dino	1	47.83	26.94	2.95	25.29	46.03	68.53	99.49	▇▇▇▅▆
y	dots	1	47.84	26.93	15.77	17.11	51.30	82.88	94.25	▇▁▇▁▆
y	h_lines	1	47.83	26.94	10.46	30.48	50.47	70.35	90.46	▆▇▇▅▅
y	high_lines	1	47.84	26.94	14.91	22.92	32.50	75.94	87.15	▇▁▁▃▅
y	slant_down	1	47.84	26.94	0.30	27.84	46.40	68.44	99.64	▆▇▇▅▆
y	slant_up	1	47.83	26.94	5.65	24.76	45.29	70.86	99.58	▇▇▇▅▅
y	star	1	47.84	26.93	14.37	20.37	50.11	63.55	92.21	▇▂▂▅▅
y	v_lines	1	47.84	26.94	2.73	22.75	47.11	65.85	99.69	▇▆▇▃▅
y	wide_lines	1	47.83	26.94	0.22	24.35	46.28	67.57	99.28	▇▇▇▅▆
y	x_shape	1	47.84	26.93	4.58	23.47	39.88	73.61	97.84	▇▇▂▆▅

Notice that all of the datasets are nearly identical. But have a look at them.

datasaurus %>% ggplot() + aes(x=x, y=y, color=dataset) + geom_point() + facet_wrap(vars(dataset))

tidyTuesday tidyverse

Robert W. Walker

Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Datasaurus Dozen

The datasaurus dozen

Robert W. Walker

Associate Professor of Quantitative Methods

Related

The `datasaurus dozen`