So this Robert Mueller guy wrote a report
I may as well analyse it a bit.
First, let me see if I can get a hold of the data. I grabbed the report directly from the Department of Justice website. You can follow this link.
library(tidyverse)
library(pdftools)
# Download report from link above
mueller_report_txt <- pdf_text("../data/report.pdf")
# Create a tibble of the text with line numbers and pages
mueller_report <- tibble(
page = 1:length(mueller_report_txt),
text = mueller_report_txt) %>% separate_rows(text, sep = "\n") %>% group_by(page) %>% mutate(line = row_number()) %>% ungroup() %>% select(page, line, text)
write_csv(mueller_report, "data/mueller_report.
Trump’s Tone
A cool post on sentiment analysis can be found here. I will now get at the time series characteristics of his tweets and the sentiment stuff.
I start by loading the tmls object that I created in the previous post.
Trump’s Overall Tweeting
What does it look like?
library(tidyverse)
library(tidytext)
library(SnowballC)
library(tm)
library(syuzhet)
library(rtweet)
load(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/TMLS.RData"))
names(tml.djt)
## [1] "user_id" "status_id" ## [3] "created_at" "screen_name" ## [5] "text" "source" ## [7] "display_text_width" "reply_to_status_id" ## [9] "reply_to_user_id" "reply_to_screen_name" ## [11] "is_quote" "is_retweet" ## [13] "favorite_count" "retweet_count" ## [15] "hashtags" "symbols" ## [17] "urls_url" "urls_t.
Mining Twitter Data
Is rather easy. You have to arrange a developer account with Twitter and set up an app. After that, Twitter gives you access to a consumer key and secret and an access token and access secret. My tool of choice for this is rtweet because it automagically processes tweet elements and makes them easy to slice and dice. I also played with twitteR but it was harder to work with for what I wanted.
EPL Scraping
In a previous post, I scraped some NFL data and learned the structure of Sportrac. Now, I want to scrape the available data on the EPL. The EPL data is organized in a few distinct but potentially linked tables. The basic structure is organized around team folders. Let me begin by isolating those URLs.
library(rvest)
library(tidyverse)
base_url <- "http://www.spotrac.com/epl/"
read.base <- read_html(base_url)
team.URL <- read.base %>% html_nodes(".team-name") %>% html_attr('href')
team.
Presidential Press
The language of presidential communications is interesting and I know very little about text as data. I have a number of applications in mind for these tools but I have to learn how to use them. What does the website look like?
White House News
The site is split in four parts: all news, articles, presidential actions, and briefings and statements. The first one is a catch all and the second is news links.