tidyTuesday beyonce_lyrics
Load the data.
beyonce_lyrics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-29/beyonce_lyrics.csv')
## ## ── Column specification ────────────────────────────────────────────────────────
## cols(
## line = col_character(),
## song_id = col_double(),
## song_name = col_character(),
## artist_id = col_double(),
## artist_name = col_character(),
## song_line = col_double()
## )
str(beyonce_lyrics)
## spec_tbl_df [22,616 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ line : chr [1:22616] "If I ain't got nothing, I got you" "If I ain't got something, I don't give a damn" "'Cause I got it with you" "I don't know much about algebra, but I know 1+1 equals 2" .
The datasaurus dozen
The datasaurus sozen is a fantastic teaching resource for examining the importance of data visualization. Let’s have a look.
datasaurus <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-10-13/datasaurus.csv')
## ## ── Column specification ────────────────────────────────────────────────────────
## cols(
## dataset = col_character(),
## x = col_double(),
## y = col_double()
## )
Two libraries to make our work easy.
library(tidyverse)
library(skimr)
First, the summary statistics.
datasaurus %>% group_by(dataset) %>% skim()
Table 1: Data summary
Name
Piped data
Number of rows
1846
Number of columns
3
_______________________
Column type frequency:
numeric
2
________________________
Group variables
dataset
Variable type: numeric
Spending on Kids
First, let me import the data.
kids <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
# kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Now let me summarise it and show a table of the variables.
summary(kids)
## state variable year raw ## Length:23460 Length:23460 Min. :1997 Min. : -60139 ## Class :character Class :character 1st Qu.:2002 1st Qu.: 71985 ## Mode :character Mode :character Median :2006 Median : 252002 ## Mean :2006 Mean : 1181359 ## 3rd Qu.
Beer Distribution
The #tidyTuesday for March 31, 2020 is on beer. The essential elements and a method for pulling the data are shown:
Imgur
A Comment on Scraping .pdf
The Tweet
The details on how the data were obtained are a nice overview of scraping .pdf files. The code for doing it is at the bottom of the page. @thomasmock has done a great job commenting his way through it.
The Office
library(tidyverse)
office_ratings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv')
A First Plot
The number of episodes for the Office by season.
library(janitor)
TableS <- office_ratings %>% tabyl(season)
p1 <- TableS %>% ggplot(., aes(x=as.factor(season), y=n, fill=as.factor(season))) + geom_col() + labs(x="Season", y="Episodes", title="The Office: Episodes") + guides(fill=FALSE)
p1
Ratings
How are the various seasons and episodes rated?
p2 <- office_ratings %>% ggplot(., aes(x=as.factor(season), y=imdb_rating, fill=as.factor(season), color=as.factor(season))) + geom_violin(alpha=0.3) + guides(fill=FALSE, color=FALSE) + labs(x="Season", y="IMDB Rating") + geom_point()
p2
Patchwork
Using patchwork, we can combine multiple plots.
tidyTuesday on the Carbon Footprint of Feeding the Planet
The tidyTuesday for this week relies on data scraped from the Food and Agricultural Organization of the United Nations. The blog post for obtaining the data can be found on r-tastic. The scraping exercise is nice and easy to follow and explored a case of cleaning up a very messy data structure. I took this exercise as practice for using pivot_wider and pivot_longer.
Trees in San Francisco
This week’s data cover trees in San Francisco.
sf_trees <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-28/sf_trees.csv')
library(tidyverse); library(ggmap); library(skimr)
skim(sf_trees)
Table 1: Data summary
Name
sf_trees
Number of rows
192987
Number of columns
12
_______________________
Column type frequency:
character
6
Date
1
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
legal_status
54
1.
First, I wanted to acquire the distribution of letters and then play with that. I embedded the result here. The second step is to import the tidyTuesday data.
library(tidyverse)
Letter.Freq <- data.frame(stringsAsFactors=FALSE,
Letter = c("E", "T", "A", "O", "I", "N", "S", "R", "H", "D", "L", "U",
"C", "M", "F", "Y", "W", "G", "P", "B", "V",
"K", "X", "Q", "J", "Z"),
Frequency = c(12.02, 9.1, 8.12, 7.68, 7.31, 6.95, 6.28, 6.