tidytext | robertwwalker.work

Some Basic Text on the Mueller Report

So this Robert Mueller guy wrote a report I may as well analyse it a bit. First, let me see if I can get a hold of the data. I grabbed the report directly from the Department of Justice website. You can follow this link. library(tidyverse) library(pdftools) # Download report from link above mueller_report_txt <- pdf_text("../data/report.pdf") # Create a tibble of the text with line numbers and pages mueller_report <- tibble( page = 1:length(mueller_report_txt), text = mueller_report_txt) %>% separate_rows(text, sep = "\n") %>% group_by(page) %>% mutate(line = row_number()) %>% ungroup() %>% select(page, line, text) write_csv(mueller_report, "data/mueller_report.

Trump's Tweets, Part II

Trump’s Tone A cool post on sentiment analysis can be found here. I will now get at the time series characteristics of his tweets and the sentiment stuff. I start by loading the tmls object that I created in the previous post. Trump’s Overall Tweeting What does it look like? library(tidyverse) library(tidytext) library(SnowballC) library(tm) library(syuzhet) library(rtweet) load(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/TMLS.RData")) names(tml.djt) ## [1] "user_id" "status_id" ## [3] "created_at" "screen_name" ## [5] "text" "source" ## [7] "display_text_width" "reply_to_status_id" ## [9] "reply_to_user_id" "reply_to_screen_name" ## [11] "is_quote" "is_retweet" ## [13] "favorite_count" "retweet_count" ## [15] "hashtags" "symbols" ## [17] "urls_url" "urls_t.

Trump Tweet Word Clouds

Mining Twitter Data Is rather easy. You have to arrange a developer account with Twitter and set up an app. After that, Twitter gives you access to a consumer key and secret and an access token and access secret. My tool of choice for this is rtweet because it automagically processes tweet elements and makes them easy to slice and dice. I also played with twitteR but it was harder to work with for what I wanted.

Scraping EPL Salary Data

EPL Scraping In a previous post, I scraped some NFL data and learned the structure of Sportrac. Now, I want to scrape the available data on the EPL. The EPL data is organized in a few distinct but potentially linked tables. The basic structure is organized around team folders. Let me begin by isolating those URLs. library(rvest) library(tidyverse) base_url <- "http://www.spotrac.com/epl/" read.base <- read_html(base_url) team.URL <- read.base %>% html_nodes(".team-name") %>% html_attr('href') team.

tidytext is neat! White House Communications

Presidential Press The language of presidential communications is interesting and I know very little about text as data. I have a number of applications in mind for these tools but I have to learn how to use them. What does the website look like? White House News The site is split in four parts: all news, articles, presidential actions, and briefings and statements. The first one is a catch all and the second is news links.