Socrata is amazingly handy for open data

Last updated on Apr 17, 2021 3 min read R

The Socrata package makes it easy to access API calls built around SODA for open data access. If you try to skip the Socrata part, you usually only get a fraction of the available data.

install.packages("RSocrata")
library(RSocrata)
SchoolSpend <- read.socrata("https://data.oregon.gov/resource/c7av-ntdz.csv")

The first bit of data that I found details various bits about spending and students in Oregon school districts. I want to look at a few basics of this. There is a lot more to plot but this is enough for now.

The Data

I found this on Oregon’s open data portal. What do I have?

library(skimr)
skim(SchoolSpend)

Table 1: Data summary
Name	SchoolSpend
Number of rows	2205
Number of columns	10
_______________________
Column type frequency:
character	3
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
county_name	1	4	10	36
district_number	1	9	33	375
school_year	1	10	10	13

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
district_id	1	2094.94	189.48	1894.0	1999.00	2085.00	2190.00	4131.00	▇▁▁▁▁
operating_cost_per_student	1	10337.76	7762.80	0.0	7252.19	8596.27	10797.39	170210.60	▇▁▁▁▁
student_count	1	2900.14	6105.83	0.0	234.00	872.00	2934.00	55321.00	▇▁▁▁▁
state_student_count	1	552475.19	5910.95	539105.0	549169.00	552161.00	558366.00	561354.00	▂▂▇▃▆
operating_cost	1	22756620.87	49527514.39	9881.0	2458166.78	7089257.97	22617257.11	513891919.14	▇▁▁▁▁
state_operating_cost	1	4337006498.57	697307559.84	948447366.8	3889552066.13	4198534676.94	5144636555.37	5248233458.10	▁▁▁▇▆
state_operating_cost_per_student	1	7839.08	1195.04	1759.3	7044.24	7636.57	9164.69	9384.06	▁▁▁▇▆

How many school districts per county?

library(magrittr); library(hrbrthemes)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

SchoolSpend %>% group_by(county_name, school_year) %>% tally() %>% mutate(school_year = as.Date(school_year, format = "%m/%d/%Y")) %>% filter(school_year == max(school_year)) %>% ggplot() + aes(x=fct_reorder(county_name, n), y=n, fill=county_name) + geom_col() + coord_flip() + guides(fill=FALSE) + labs(x= "County", y="Number of School Districts") + theme_minimal()

By Students?

SchoolSpend %>% group_by(county_name) %>% mutate(school_year = as.Date(school_year, format = "%m/%d/%Y")) %>% filter(school_year == max(school_year)) %>% summarise(Students = sum(student_count), Year = mean(school_year), County = as.factor(county_name)) %>% unique() -> Dat

## `summarise()` has grouped output by 'county_name'. You can override using the `.groups` argument.

ggplot(Dat) + aes(x=fct_reorder(County, -Students), y=Students, fill=county_name) + geom_col() + coord_flip() + guides(fill=FALSE) + labs(x= "County", y="Students") + theme_minimal()

There are a number of other bits of data organized by year and district. There is certainly more to examine, but then I found this.

Voter Registration Data

The database of Voter Registrations in Oregon is also available and easily accessible.

VoterReg <- read.socrata("https://data.oregon.gov/resource/6a4f-ecbi.csv")
VoterReg %>% filter(sysdate == "2020-11-03") %>% group_by(county) %>% summarise(Voters = sum(count_v_id)) %>% ggplot(., aes(x=fct_reorder(county, Voters), y=Voters, label=Voters)) + geom_col(fill="white", color="skyblue") + geom_text(size=2.2) + coord_flip() + labs(x="County", y="Registered Voters") + theme_minimal() -> Plot1
Plot1

library(plotly); library(widgetframe)
ggp1 <- ggplotly(Plot1)
frameWidget(ggp1)

The Balance of Registrations

CurrVR <- VoterReg %>% filter(sysdate == "2020-11-03")
CurrVR$DRE <- "Other"
CurrVR$DRE[CurrVR$party=="Democrat"] <- "Democrat"
CurrVR$DRE[CurrVR$party=="Republican"] <- "Republican"
CurrVR %>% group_by(county) %>% mutate(Voters = sum(count_v_id)) %>% ggplot(., aes(x=fct_reorder(county, Voters), y=Voters, label=Voters)) + geom_col() + geom_text(size=2.2) + coord_flip() + labs(x="County", y="Registered Voters") + theme_minimal()

The Plot by Party

Now let me split these up by grouping and plot them.

CurrVR %>% group_by(county, DRE) %>% summarise(Voters = sum(count_v_id)) %>%
ggplot(.) +
 aes(x = fct_reorder(county, Voters), y=Voters, fill = DRE) +
 geom_col() + scale_fill_viridis_d() +
 coord_flip() +
 theme_minimal() + labs(x="County")

## `summarise()` has grouped output by 'county'. You can override using the `.groups` argument.

Robert W. Walker

Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.