Trump Tweet Word Clouds

Mining Twitter Data

Is rather easy. You have to arrange a developer account with Twitter and set up an app. After that, Twitter gives you access to a consumer key and secret and an access token and access secret. My tool of choice for this is rtweet because it automagically processes tweet elements and makes them easy to slice and dice. I also played with twitteR but it was harder to work with for what I wanted. The first section involves setting up a token for `rtweet.

# Change the next four lines based on your own consumer_key, consume_secret, access_token, and access_secret. 
token <- create_token(
  app = "MyAppName",
  consumer_key <- "CK",
  consumer_secret <- "CS",
  access_token <- "AT",
  access_secret <- "AS")

Now I want to collect some tweets from a particular user’s timeline and look into them. For this example, I will use @realDonaldTrump.

Who does Trump tweet about?

A cool post on sentiment analysis can be found here. The first step is to grab his timeline. rtweet makes this quite easy. I will grab it and then save it in the code below so that I do not spam the API. I will get at the time series characteristics of his tweets and the sentiment stuff in a further analysis. For now, let me just show some wordclouds.

tml.djt <- get_timeline("realDonaldTrump", n = 3200)
save(tml.djt, file="../data/TMLS.RData")

I start by loading the tmls object that I created above. What does it look like?

library(wordcloud2)
library(tidyverse)
library(tidytext)
library(rtweet)
load(url("https://github.com/robertwwalker/academic-mymod/raw/master/data/TMLS.RData"))
names(tml.djt)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"

I want to first get rid of retweets to render President Trump in his own voice.

DJTDF <- tml.djt %>% filter(is_retweet==FALSE)

With just his tweets, a few things can be easily accomplished. Who does he mention?

library(wordcloud)
## Loading required package: RColorBrewer
MNTDJT <- DJTDF %>% filter(!is.na(mentions_screen_name)) %>% select(mentions_screen_name)
Ments <- as.character(unlist(MNTDJT))
TMents <- data.frame(table(Ments))
pal <- brewer.pal(8,"Spectral")
wordcloud(TMents$Ments,TMents$Freq, colors=pal)

That’s interesting. But that is twitter accounts. That is far less interesting that his actual text. I want to look at words and bigrams for this segment.

What does Trump tweet about?

Some more stuff from stack overflow. There is quite a bit of code in here. I simply wrote a function that takes an input character string and cleans it up. Uncomment the various components and pipe them. The sequencing is important and I found this to get everything that I wanted.

library(RColorBrewer)
TDF <- DJTDF %>% select(text)
# TDF contains the text of tweets.
library(stringr)
tweet_cleaner <- function(text) {
  temp1 <- str_replace_all(text, "&amp", "") %>% 
    str_replace_all(., "https://t+", "") %>%
    str_replace_all(.,"@[a-z,A-Z]*","")
#    str_replace_all(., "[[:punct:]]", "")  
#    str_replace_all(., "[[:digit:]]", "") %>%
#    str_replace_all(., "[ \t]{2,}", "") %>%
#    str_replace_all(., "^\\s+|\\s+$", "")  %>%
#    str_replace_all(., " "," ") %>%
#    str_replace_all(., "http://t.co/[a-z,A-Z,0-9]*{8}","")
#    str_replace_all(.,"RT @[a-z,A-Z]*: ","") %>% 
#    str_replace_all(.,"#[a-z,A-Z]*","")
  return(temp1)
}
clean_tweets <- data.frame(text=sapply(1:dim(TDF)[[1]], function(x) {tweet_cleaner(TDF[x,"text"])}))
clean_tweets$text <- as.character(clean_tweets$text)
Trumps.Words <- clean_tweets %>% unnest_tokens(., word, text) %>% anti_join(stop_words, "word")
TTW <- table(Trumps.Words)
TTW <- TTW[order(TTW, decreasing = T)]
TTW <- data.frame(TTW)
names(TTW) <- c("word","freq")
wordcloud(TTW$word, TTW$freq)

Well, that is kinda cool. Now, I want to do a bit more with it using more complicated word combinations.

The Wonders of tidytext

The tidytext section on n-grams is great. I will start with a tweet identifier – something I should have deployed long ago – before parsing these; I will not need this now but it will be encessary when the sentiment stuff comes around.

library(tidyr)
CT <- clean_tweets %>% mutate(tweetno= row_number())
DJT2G <- clean_tweets %>% unnest_tokens(bigram, text, token = "ngrams", n=2)

bigrams_separated <- DJT2G %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts
## # A tibble: 10,514 x 3
##    word1   word2           n
##    <chr>   <chr>       <int>
##  1 fake    news          160
##  2 witch   hunt          128
##  3 north   korea          84
##  4 white   house          71
##  5 news    media          56
##  6 total   endorsement    49
##  7 law     enforcement    47
##  8 crooked hillary        43
##  9 supreme court          39
## 10 border  security       38
## # … with 10,504 more rows
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

my.df <- data.frame(table(bigrams_united))
my.df <- my.df[order(my.df$Freq, decreasing=TRUE),]
my.df <- my.df[c(1:500),]
head(my.df)
##          bigrams_united Freq
## 3599          fake news  160
## 10268        witch hunt  128
## 6583        north korea   84
## 10220       white house   71
## 6517         news media   56
## 9375  total endorsement   49

With that, we have the data for the bigram cloud.

library(wordcloud2)
wordcloud2(my.df, color="random-light", backgroundColor = "black")

After seeing a few competing renditions, I prefer wordcloud2. One thing to be careful about is scaling. In this case, the most frequent bigram is missing because the ratio makes it too large to fit. With size smaller, it can be made to show. It appears that embedding multiple of these in one post does not render. I will stick with the one correct one.

library(wordcloud2)
hhww <- wordcloud2(my.df, color="random-light", backgroundColor = "black", size = 0.5)
library(widgetframe)
## Loading required package: htmlwidgets
frameWidget(hhww, width=600)

Getting this to work with frame widgets is tricky. I started something below but cannot seem to make it work so I am constrained to one wordcloud2 per document because they rely on underlying html rendering.

library(htmlwidgets)
library(webshot)
library(widgetframe)
hw1 <- wordcloud2(my.df, color="random-light", backgroundColor = "black", size = 0.5)
frameWidget(hw1, width=600)

I think that works quite nicely. The use of jpg for shapes has not worked for me. Nor has letterCloud. I found some code on github that will supposedly solve this but it does not seem to work either. It is supposed to render as an htmlwidget but something about that seems not to work properly.

library(htmlwidgets)
library(webshot)
library(widgetframe)
Ments.Tab <- data.frame(table(Ments))
Ments.Tab <- Ments.Tab[order(Ments.Tab$Freq, decreasing=TRUE),]
my.df.short <- my.df[c(1:40),]
hw1 <- letterCloud(Ments.Tab, "@", size=4, color='random-light')
frameWidget(hw1, width=600)
Avatar
Robert W. Walker
Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Next
Previous

Related