Scraping EPL Salary Data

EPL Scraping

In a previous post, I scraped some NFL data and learned the structure of Sportrac. Now, I want to scrape the available data on the EPL. The EPL data is organized in a few distinct but potentially linked tables. The basic structure is organized around team folders. Let me begin by isolating those URLs.

library(rvest)
library(tidyverse)
base_url <- "http://www.spotrac.com/epl/"
read.base <- read_html(base_url)
team.URL <- read.base %>% html_nodes(".team-name") %>% html_attr('href')
team.URL
##  [1] "https://www.spotrac.com/epl/afc-bournemouth/"            
##  [2] "https://www.spotrac.com/epl/arsenal-f.c/"                
##  [3] "https://www.spotrac.com/epl/aston-villa-f.c/"            
##  [4] "https://www.spotrac.com/epl/brighton-hove-albion/"       
##  [5] "https://www.spotrac.com/epl/burnley-f.c/"                
##  [6] "https://www.spotrac.com/epl/chelsea-f.c/"                
##  [7] "https://www.spotrac.com/epl/crystal-palace/"             
##  [8] "https://www.spotrac.com/epl/everton-f.c/"                
##  [9] "https://www.spotrac.com/epl/leicester-city/"             
## [10] "https://www.spotrac.com/epl/liverpool-f.c/"              
## [11] "https://www.spotrac.com/epl/manchester-city-f.c/"        
## [12] "https://www.spotrac.com/epl/manchester-united-f.c/"      
## [13] "https://www.spotrac.com/epl/newcastle-united-f.c/"       
## [14] "https://www.spotrac.com/epl/norwich-city-f.c/"           
## [15] "https://www.spotrac.com/epl/sheffield-united-f.c/"       
## [16] "https://www.spotrac.com/epl/southampton-f.c/"            
## [17] "https://www.spotrac.com/epl/tottenham-hotspur-f.c/"      
## [18] "https://www.spotrac.com/epl/watford/"                    
## [19] "https://www.spotrac.com/epl/west-ham-united-f.c/"        
## [20] "https://www.spotrac.com/epl/wolverhampton-wanderers-f.c/"
# Clean up the URLs to get the team names by themselves.
team.names <- gsub(base_url, "", team.URL)
team.names <- gsub("-f.c", " FC", team.names)
team.names <- gsub("afc", "AFC", team.names)
team.names <- gsub("a.f.c", "AFC", team.names)
# Dashes and slashes need to  removed.
team.names <- gsub("-", " ", team.names)
team.names <- gsub("/", "", team.names)
# Fix FC and AFC for Bournemouth
simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2), sep="", collapse=" ")
  }
# Capitalise and trim white space
team.names <- sapply(team.names, simpleCap)
#team.names <- sapply(team.names, trimws)
names(team.names) <- NULL
# Now I have a vector of 20 names.
short.names <- gsub(" FC","", team.names)
short.names <- gsub(" AFC","", short.names)
EPL.names <- data.frame(team.names,short.names,team.URL)
EPL.names
##                                            team.names
## 1             Https:www.spotrac.comeplAFC Bournemouth
## 2                  Https:www.spotrac.comeplarsenal FC
## 3              Https:www.spotrac.comeplaston Villa FC
## 4        Https:www.spotrac.comeplbrighton Hove Albion
## 5                  Https:www.spotrac.comeplburnley FC
## 6                  Https:www.spotrac.comeplchelsea FC
## 7              Https:www.spotrac.comeplcrystal Palace
## 8                  Https:www.spotrac.comepleverton FC
## 9              Https:www.spotrac.comeplleicester City
## 10               Https:www.spotrac.comeplliverpool FC
## 11         Https:www.spotrac.comeplmanchester City FC
## 12       Https:www.spotrac.comeplmanchester United FC
## 13        Https:www.spotrac.comeplnewcastle United FC
## 14            Https:www.spotrac.comeplnorwich City FC
## 15        Https:www.spotrac.comeplsheffield United FC
## 16             Https:www.spotrac.comeplsouthampton FC
## 17       Https:www.spotrac.comepltottenham Hotspur FC
## 18                    Https:www.spotrac.comeplwatford
## 19         Https:www.spotrac.comeplwest Ham United FC
## 20 Https:www.spotrac.comeplwolverhampton Wanderers FC
##                                        short.names
## 1          Https:www.spotrac.comeplAFC Bournemouth
## 2                  Https:www.spotrac.comeplarsenal
## 3              Https:www.spotrac.comeplaston Villa
## 4     Https:www.spotrac.comeplbrighton Hove Albion
## 5                  Https:www.spotrac.comeplburnley
## 6                  Https:www.spotrac.comeplchelsea
## 7           Https:www.spotrac.comeplcrystal Palace
## 8                  Https:www.spotrac.comepleverton
## 9           Https:www.spotrac.comeplleicester City
## 10               Https:www.spotrac.comeplliverpool
## 11         Https:www.spotrac.comeplmanchester City
## 12       Https:www.spotrac.comeplmanchester United
## 13        Https:www.spotrac.comeplnewcastle United
## 14            Https:www.spotrac.comeplnorwich City
## 15        Https:www.spotrac.comeplsheffield United
## 16             Https:www.spotrac.comeplsouthampton
## 17       Https:www.spotrac.comepltottenham Hotspur
## 18                 Https:www.spotrac.comeplwatford
## 19         Https:www.spotrac.comeplwest Ham United
## 20 Https:www.spotrac.comeplwolverhampton Wanderers
##                                                    team.URL
## 1              https://www.spotrac.com/epl/afc-bournemouth/
## 2                  https://www.spotrac.com/epl/arsenal-f.c/
## 3              https://www.spotrac.com/epl/aston-villa-f.c/
## 4         https://www.spotrac.com/epl/brighton-hove-albion/
## 5                  https://www.spotrac.com/epl/burnley-f.c/
## 6                  https://www.spotrac.com/epl/chelsea-f.c/
## 7               https://www.spotrac.com/epl/crystal-palace/
## 8                  https://www.spotrac.com/epl/everton-f.c/
## 9               https://www.spotrac.com/epl/leicester-city/
## 10               https://www.spotrac.com/epl/liverpool-f.c/
## 11         https://www.spotrac.com/epl/manchester-city-f.c/
## 12       https://www.spotrac.com/epl/manchester-united-f.c/
## 13        https://www.spotrac.com/epl/newcastle-united-f.c/
## 14            https://www.spotrac.com/epl/norwich-city-f.c/
## 15        https://www.spotrac.com/epl/sheffield-united-f.c/
## 16             https://www.spotrac.com/epl/southampton-f.c/
## 17       https://www.spotrac.com/epl/tottenham-hotspur-f.c/
## 18                     https://www.spotrac.com/epl/watford/
## 19         https://www.spotrac.com/epl/west-ham-united-f.c/
## 20 https://www.spotrac.com/epl/wolverhampton-wanderers-f.c/

With clean names, I can take each of the scraping tasks in order.

Payroll Data

The teams have payroll information that is broken down into active players, reserves, and loanees. The workflow is first to create the relevant URLs to scrape the payroll data.

team_links <- paste0(team.URL,"payroll/",sep="")

With URLs, I am going to set forth on the task. First, the SelectorGadget and a glimpse of the documents suggests an easy solution. I want to isolate the table nodes and keep the tables. First, a function for the URLs.

data.creator <- function(link) {
read_html(link) %>% html_nodes("table") %>% html_table(header=TRUE, fill=TRUE)
}

Now I want to apply data scraping function to the URLs. Then, I want to name the list items, assess the size of the active roster, and then clean up the relevant data.

EPL.salary <- sapply(team_links, function(x) {data.creator(x)})
names(EPL.salary) <- EPL.names$short.names
team.len <- sapply(seq(1,20), function(x) { dim(EPL.salary[[x]][[1]])[[1]]})
Team <- rep(EPL.names$short.names, team.len)
Players <- sapply(seq(1,20), function(x) { str_split(EPL.salary[[x]][[1]][,1], "\t", simplify=TRUE)[,31]})
Position <- sapply(seq(1,20), function(x) { EPL.salary[[x]][[1]][,2]})
Base.Salary <- sapply(seq(1,20), function(x) { Res <- gsub("£", "", EPL.salary[[x]][[1]][,3]); gsub(",","",Res)})
EPL.Result <- data.frame(Players=unlist(Players), Team=Team, Position=unlist(Position), Base.Salary=unlist(Base.Salary))
EPL.Result$Base.Salary <- str_replace(EPL.Result$Base.Salary, "-", NA_character_)
EPL.Result$Base.Num <- as.numeric(EPL.Result$Base.Salary)
EPL.Result %>% group_by(Position) %>% summarise(Mean.Base.Salary=mean(Base.Num, na.rm=TRUE),sdBS=sd(Base.Num, na.rm = TRUE))
## # A tibble: 4 x 3
##   Position Mean.Base.Salary  sdBS
##   <fct>               <dbl> <dbl>
## 1 D                    25.8  3.77
## 2 F                    24.6  3.80
## 3 GK                   28.9  4.63
## 4 M                    24.8  5.21
EPL.Result %>% group_by(Position,Team) %>% summarise(Mean.Base.Salary=mean(Base.Num, na.rm=TRUE),sdBS=sd(Base.Num, na.rm = TRUE))
## # A tibble: 80 x 4
## # Groups:   Position [4]
##    Position Team                                         Mean.Base.Salary  sdBS
##    <fct>    <fct>                                                   <dbl> <dbl>
##  1 D        Https:www.spotrac.comeplAFC Bournemouth                  25.1  4.30
##  2 D        Https:www.spotrac.comeplarsenal                          24.9  3.59
##  3 D        Https:www.spotrac.comeplaston Villa                      24.9  3.09
##  4 D        Https:www.spotrac.comeplbrighton Hove Albion             26.5  3.96
##  5 D        Https:www.spotrac.comeplburnley                          28.2  2.60
##  6 D        Https:www.spotrac.comeplchelsea                          25.3  2.66
##  7 D        Https:www.spotrac.comeplcrystal Palace                   27.8  3.19
##  8 D        Https:www.spotrac.comepleverton                          27    3.82
##  9 D        Https:www.spotrac.comeplleicester City                   26    5.81
## 10 D        Https:www.spotrac.comeplliverpool                        25    3.42
## # … with 70 more rows

Finally, a little picture to describe spending on the active roster.

fplot <- ggplot(EPL.Result, aes(Base.Num,Team))
gpl <- fplot + geom_jitter(height=0.25, width=0) + facet_wrap(~Position) + labs(x="Base Salary")
gpl

Contracts

The contracts are stored in a different URL structure that is accessible via contracts in the html tree by tean. Firstm I want to paste the names together with links to explore.

team_links <- paste0(team.URL,"contracts/",sep="")

Now I have all the links that I need and can turn to processing the data. This is something of a mess. Let me first grab some data to showcase the problem. In what follows, first I will grab the HTML files.

Base.Contracts <- lapply(team_links, read_html)

Processing them is a bit more difficult. What does the basic table look like?

Base.Contracts[[1]] %>% html_nodes("table") %>% html_table(header=TRUE, fill=TRUE)
## [[1]]
##                  Player (31) Pos. Age
## 1         DefoeJermain Defoe    F  37
## 2       BegovicAsmir Begovic   GK  32
## 3       LermaJefferson Lerma    M  25
## 4        WilsonCallum Wilson    F  27
## 5            KingJoshua King    F  28
## 6              AkeNathan Ake    D  24
## 7             CookSteve Cook    D  28
## 8        SurmanAndrew Surman    M  33
## 9           ArterHarry Arter    M  30
## 10             IbeJordon Ibe    F  24
## 11            CookLewis Cook    M  23
## 12        GoslingDan Gosling    M  30
## 13           SmithAdam Smith    D  28
## 14         FraserRyan Fraser    F  25
## 15    DanielsCharlie Daniels    D  33
## 16 StanislasJunior Stanislas    F  30
## 17           SmithBrad Smith    D  25
## 18        BrooksDavid Brooks    M  22
## 19            RicoDiego Rico    D  26
## 20    SolankeDominic Solanke    F  22
## 21     BillingPhilip Billing    M  23
## 22      FrancisSimon Francis    D  34
## 23        MephamChris Mepham    D  22
## 24          BorucArtur Boruc   GK  39
## 25    RamsdaleAaron Ramsdale   GK  21
## 26       SimpsonJack Simpson    D  23
## 27      SurridgeSam Surridge    F  22
## 28          KellyLloyd Kelly    D  21
## 29         StaceyJack Stacey    D  23
## 30       TraversMark Travers   GK  20
## 31     DanjumaArnaut Danjuma    F  23
##                                                             Contract Terms
## 1  16140000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£16,140,000
## 2  14560000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£14,560,000
## 3  13000000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£13,000,000
## 4  12480000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£12,480,000
## 5  11700000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£11,700,000
## 6  10400000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£10,400,000
## 7    9360000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£9,360,000
## 8    9100000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£9,100,000
## 9    8320000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£8,320,000
## 10   7904000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,904,000
## 11   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 12   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 13   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 14   5616000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,616,000
## 15   5460000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,460,000
## 16   5460000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,460,000
## 17   5200000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,200,000
## 18   4160000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£4,160,000
## 19   4004000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£4,004,000
## 20   3900000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,900,000
## 21   3900000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,900,000
## 22   3120000\n\t\t\t\t\t\t\t\t\t\t\t2 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,120,000
## 23   3035000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,035,000
## 24   2080000\n\t\t\t\t\t\t\t\t\t\t\t1 yr\n\t\t\t\t\t\t\t\t\t\t\t£2,080,000
## 25      780000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£780,000
## 26      468000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£468,000
## 27                  0\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t-
## 28                  0\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t-
## 29                     0\n\t\t\t\t\t\t\t\t\t\t\t-\n\t\t\t\t\t\t\t\t\t\t\t-
## 30                     0\n\t\t\t\t\t\t\t\t\t\t\t-\n\t\t\t\t\t\t\t\t\t\t\t-
## 31                  0\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t-
##    Avg. Salary Transfer Fee Expires
## 1   £5,380,000            -    2020
## 2   £3,640,000   £9,780,000    2021
## 3   £2,600,000  £25,125,000    2023
## 4   £3,120,000   £3,000,000    2023
## 5   £2,340,000            -    2021
## 6   £2,080,000  £19,380,000    2022
## 7   £2,340,000     £150,000    2021
## 8   £1,820,000     £536,000    2020
## 9   £2,080,000            -    2021
## 10  £1,976,000  £15,000,000    2020
## 11  £1,820,000   £5,950,000    2022
## 12  £1,820,000            -    2021
## 13  £1,820,000            -    2021
## 14  £1,404,000     £400,000    2020
## 15  £1,820,000     £191,000    2020
## 16  £1,820,000            -    2021
## 17  £1,300,000   £3,060,000    2020
## 18  £1,040,000  £10,125,000    2022
## 19  £1,001,000  £10,750,000    2022
## 20  £1,300,000  £24,170,000    2022
## 21    £780,000  £18,810,000    2024
## 22  £1,560,000            -    2020
## 23    £607,000  £15,500,000    2023
## 24  £2,080,000  £10,000,000    2020
## 25    £156,000     £799,000    2021
## 26    £156,000            -    2020
## 27           -            -    2021
## 28           -  £16,870,000       -
## 29           -   £5,070,000       -
## 30           -            -       -
## 31           -  £20,520,000    2024
## 
## [[2]]
##    Player (31) Pos. Age                Contract Terms Avg. Salary Transfer Fee
## 1 Harry Wilson    M  21 1 yr\n\t\t\t\t\t\t\t\t\t\t\t-           -            -
##   Expires
## 1       -

The names and the contract year and terms are going to require parsing. I have chosen the first html that corresponds to Bournemouth; other teams are worse because loan players are in a second table. That impacts the wage bill, perhaps, depending on the arrangement in the loan, but the contract details from the player do not have that team as signatory. This has to be fixed. That is easy enough to fix, there are two embedded tables and I can select the first one. When it comes to the names, there is no easy separation for the first column; I will grab them from nodes in the html.

data.creator <- function(data) { 
  data %>% html_nodes("table") %>% html_table(header=TRUE, fill=TRUE) -> ret.tab
  nrowsm <- dim(ret.tab[[1]])[[1]]
  split.me <- ret.tab[[1]][,4]
  tempdf <- data.frame(matrix(data=gsub("\t|-","",unlist(strsplit(split.me, "\\n"))), nrow=nrowsm, byrow=TRUE))
  names(tempdf) <- c("value","years","value.pds")
  data %>% html_nodes(".player") %>% html_nodes("a") %>% html_text() -> Player.Names
  Player.Names <- Player.Names[c(1:nrowsm)]
  data %>% html_nodes(".player") %>% html_nodes("a") %>% html_attr("href") -> Player.Links
  Player.links <- Player.Links[c(1:nrowsm)]
  data %>% html_nodes(".player") %>% html_nodes("span") %>% html_text() -> Last.Name
  Last.Name <- Last.Name[c(1:nrowsm)]
  names(ret.tab[1][[1]])[c(1:2)] <- c("Player","Position")
#  data.frame(ret.tab[,c(5,6,7)]) 
  return(data.frame(ret.tab[1][[1]],tempdf,Player.Names,Player.links,Last.Name))
}
EPL.Contracts <- lapply(Base.Contracts, data.creator)
names(EPL.Contracts) <- EPL.names$short.names
EPL.Contracts[[1]]
##                       Player Position Age
## 1         DefoeJermain Defoe        F  37
## 2       BegovicAsmir Begovic       GK  32
## 3       LermaJefferson Lerma        M  25
## 4        WilsonCallum Wilson        F  27
## 5            KingJoshua King        F  28
## 6              AkeNathan Ake        D  24
## 7             CookSteve Cook        D  28
## 8        SurmanAndrew Surman        M  33
## 9           ArterHarry Arter        M  30
## 10             IbeJordon Ibe        F  24
## 11            CookLewis Cook        M  23
## 12        GoslingDan Gosling        M  30
## 13           SmithAdam Smith        D  28
## 14         FraserRyan Fraser        F  25
## 15    DanielsCharlie Daniels        D  33
## 16 StanislasJunior Stanislas        F  30
## 17           SmithBrad Smith        D  25
## 18        BrooksDavid Brooks        M  22
## 19            RicoDiego Rico        D  26
## 20    SolankeDominic Solanke        F  22
## 21     BillingPhilip Billing        M  23
## 22      FrancisSimon Francis        D  34
## 23        MephamChris Mepham        D  22
## 24          BorucArtur Boruc       GK  39
## 25    RamsdaleAaron Ramsdale       GK  21
## 26       SimpsonJack Simpson        D  23
## 27      SurridgeSam Surridge        F  22
## 28          KellyLloyd Kelly        D  21
## 29         StaceyJack Stacey        D  23
## 30       TraversMark Travers       GK  20
## 31     DanjumaArnaut Danjuma        F  23
##                                                             Contract.Terms
## 1  16140000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£16,140,000
## 2  14560000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£14,560,000
## 3  13000000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£13,000,000
## 4  12480000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£12,480,000
## 5  11700000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£11,700,000
## 6  10400000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£10,400,000
## 7    9360000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£9,360,000
## 8    9100000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£9,100,000
## 9    8320000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£8,320,000
## 10   7904000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,904,000
## 11   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 12   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 13   7280000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£7,280,000
## 14   5616000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,616,000
## 15   5460000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,460,000
## 16   5460000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,460,000
## 17   5200000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£5,200,000
## 18   4160000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£4,160,000
## 19   4004000\n\t\t\t\t\t\t\t\t\t\t\t4 yr\n\t\t\t\t\t\t\t\t\t\t\t£4,004,000
## 20   3900000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,900,000
## 21   3900000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,900,000
## 22   3120000\n\t\t\t\t\t\t\t\t\t\t\t2 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,120,000
## 23   3035000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£3,035,000
## 24   2080000\n\t\t\t\t\t\t\t\t\t\t\t1 yr\n\t\t\t\t\t\t\t\t\t\t\t£2,080,000
## 25      780000\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t£780,000
## 26      468000\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t£468,000
## 27                  0\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t-
## 28                  0\n\t\t\t\t\t\t\t\t\t\t\t3 yr\n\t\t\t\t\t\t\t\t\t\t\t-
## 29                     0\n\t\t\t\t\t\t\t\t\t\t\t-\n\t\t\t\t\t\t\t\t\t\t\t-
## 30                     0\n\t\t\t\t\t\t\t\t\t\t\t-\n\t\t\t\t\t\t\t\t\t\t\t-
## 31                  0\n\t\t\t\t\t\t\t\t\t\t\t5 yr\n\t\t\t\t\t\t\t\t\t\t\t-
##    Avg..Salary Transfer.Fee Expires    value years   value.pds     Player.Names
## 1   £5,380,000            -    2020 16140000  3 yr £16,140,000    Jermain Defoe
## 2   £3,640,000   £9,780,000    2021 14560000  4 yr £14,560,000    Asmir Begovic
## 3   £2,600,000  £25,125,000    2023 13000000  5 yr £13,000,000  Jefferson Lerma
## 4   £3,120,000   £3,000,000    2023 12480000  4 yr £12,480,000    Callum Wilson
## 5   £2,340,000            -    2021 11700000  5 yr £11,700,000      Joshua King
## 6   £2,080,000  £19,380,000    2022 10400000  5 yr £10,400,000       Nathan Ake
## 7   £2,340,000     £150,000    2021  9360000  4 yr  £9,360,000       Steve Cook
## 8   £1,820,000     £536,000    2020  9100000  5 yr  £9,100,000    Andrew Surman
## 9   £2,080,000            -    2021  8320000  4 yr  £8,320,000      Harry Arter
## 10  £1,976,000  £15,000,000    2020  7904000  4 yr  £7,904,000       Jordon Ibe
## 11  £1,820,000   £5,950,000    2022  7280000  4 yr  £7,280,000       Lewis Cook
## 12  £1,820,000            -    2021  7280000  4 yr  £7,280,000      Dan Gosling
## 13  £1,820,000            -    2021  7280000  4 yr  £7,280,000       Adam Smith
## 14  £1,404,000     £400,000    2020  5616000  4 yr  £5,616,000      Ryan Fraser
## 15  £1,820,000     £191,000    2020  5460000  3 yr  £5,460,000  Charlie Daniels
## 16  £1,820,000            -    2021  5460000  3 yr  £5,460,000 Junior Stanislas
## 17  £1,300,000   £3,060,000    2020  5200000  4 yr  £5,200,000       Brad Smith
## 18  £1,040,000  £10,125,000    2022  4160000  4 yr  £4,160,000     David Brooks
## 19  £1,001,000  £10,750,000    2022  4004000  4 yr  £4,004,000       Diego Rico
## 20  £1,300,000  £24,170,000    2022  3900000  3 yr  £3,900,000  Dominic Solanke
## 21    £780,000  £18,810,000    2024  3900000  5 yr  £3,900,000   Philip Billing
## 22  £1,560,000            -    2020  3120000  2 yr  £3,120,000    Simon Francis
## 23    £607,000  £15,500,000    2023  3035000  5 yr  £3,035,000     Chris Mepham
## 24  £2,080,000  £10,000,000    2020  2080000  1 yr  £2,080,000      Artur Boruc
## 25    £156,000     £799,000    2021   780000  5 yr    £780,000   Aaron Ramsdale
## 26    £156,000            -    2020   468000  3 yr    £468,000     Jack Simpson
## 27           -            -    2021        0  3 yr                 Sam Surridge
## 28           -  £16,870,000       -        0  3 yr                  Lloyd Kelly
## 29           -   £5,070,000       -        0                        Jack Stacey
## 30           -            -       -        0                       Mark Travers
## 31           -  £20,520,000    2024        0  5 yr               Arnaut Danjuma
##                                      Player.links Last.Name
## 1  https://www.spotrac.com/redirect/player/23836/     Defoe
## 2  https://www.spotrac.com/redirect/player/22625/   Begovic
## 3  https://www.spotrac.com/redirect/player/27878/     Lerma
## 4  https://www.spotrac.com/redirect/player/22694/    Wilson
## 5  https://www.spotrac.com/redirect/player/22685/      King
## 6  https://www.spotrac.com/redirect/player/15521/       Ake
## 7  https://www.spotrac.com/redirect/player/22677/      Cook
## 8  https://www.spotrac.com/redirect/player/22692/    Surman
## 9  https://www.spotrac.com/redirect/player/22674/     Arter
## 10 https://www.spotrac.com/redirect/player/22684/       Ibe
## 11 https://www.spotrac.com/redirect/player/22676/      Cook
## 12 https://www.spotrac.com/redirect/player/22682/   Gosling
## 13 https://www.spotrac.com/redirect/player/22689/     Smith
## 14 https://www.spotrac.com/redirect/player/22680/    Fraser
## 15 https://www.spotrac.com/redirect/player/22678/   Daniels
## 16 https://www.spotrac.com/redirect/player/22691/ Stanislas
## 17 https://www.spotrac.com/redirect/player/22690/     Smith
## 18 https://www.spotrac.com/redirect/player/27877/    Brooks
## 19 https://www.spotrac.com/redirect/player/27879/      Rico
## 20 https://www.spotrac.com/redirect/player/22637/   Solanke
## 21 https://www.spotrac.com/redirect/player/23192/   Billing
## 22 https://www.spotrac.com/redirect/player/22679/   Francis
## 23 https://www.spotrac.com/redirect/player/28460/    Mepham
## 24 https://www.spotrac.com/redirect/player/22660/     Boruc
## 25 https://www.spotrac.com/redirect/player/22661/  Ramsdale
## 26 https://www.spotrac.com/redirect/player/24102/   Simpson
## 27 https://www.spotrac.com/redirect/player/22693/  Surridge
## 28 https://www.spotrac.com/redirect/player/30108/     Kelly
## 29 https://www.spotrac.com/redirect/player/32060/    Stacey
## 30 https://www.spotrac.com/redirect/player/32306/   Travers
## 31 https://www.spotrac.com/redirect/player/32702/   Danjuma

The data now have some junk alongside workable versions of the variables of interest. It is worth noting that the header of the contracts data allows us to verify the size of the table as we picked it up [though I do rename them to allow the rbind to work]. This also suggests a strategy for picking up the rownames that is different than the above method that uses the dimension of the html table. Perhaps I should just gsub the header to recover the integer number of players. To tidy the data, they need to be stacked. A simple do.call and row bind will probably work.

Team.Base <- sapply(EPL.Contracts, dim)[1,]
Team <- rep(as.character(names(Team.Base)),Team.Base)
EPL.Contracts.df <- do.call("rbind",EPL.Contracts)
rownames(EPL.Contracts.df) <- NULL
EPL.Contracts.df$Team <- Team
EPL.Contracts.df$value <- as.numeric(as.character(EPL.Contracts.df$value))
EPL.Contracts.df %>% group_by(Team) %>% summarise(Team.Mean=mean(value, na.rm=TRUE)/1e3, Team.SD=sd(value, na.rm=TRUE)) -> Team.mean
pp <- Team.mean %>% arrange(Team.Mean)
pp$Team <- factor(pp$Team, levels = pp$Team)
pp %>% ggplot(aes(Team.Mean,Team, size=Team.SD)) + geom_point() + labs(x="Avg. Contract (1000s)") -> cplot
cplot

EPL.Contracts.df %>% group_by(Team) %>% summarise(Age.Mean=mean(Age, na.rm=TRUE), Age.SD=sd(Age, na.rm=TRUE)) -> Team.mean
Team.mean %>% ungroup() %>% arrange(., Age.Mean) -> pp
pp$Team <- factor(pp$Team, levels = pp$Team)
pp %>% ggplot(aes(Age.Mean,Team,size=Age.SD)) + geom_point() + labs(x="Age") -> cplot
cplot

Avatar
Robert W. Walker
Associate Professor of Quantitative Methods

My research interests include causal inference, statistical computation and data visualization.

Next
Previous

Related