Published: May 16, 2018
text in italic TT hheisi sp aacfrkagitecls en eeded.
library(tidyverse)
library(rvest)
library(stringr)
Sports data is all over the internet. Some of it is in a nice downloadable format. For example, it's possible download almost all of the information in CSV format from pro-football-reference (or one of their other sites like <baseball-reference.com>). Other times the data is in a nice format but there's no easy way to export or download it. It's this situation that we'll deal with.
A note about scrapping: Scrapping is a gray area. How legal is it? On the bad things you can do on the internet spectrum it's probably worse than creating fake logins but better than piracy. However, this isn't the first tutorial that details how to scrape data from a site. With that said, don't pound servers that don't belong to you. If you're going to scrape a lot of data at least put a reasonable delay between your requests.
For the example we're going to use NFL data gathered by <teamrankings.com>. Stats nicely formatted in table form makes importing data into R easy.
The plan is to scrape NFL stats, more precisely the team stats. Luckily, <teamrankings.com> has a minimal design. We can go to the NFL tab in the nav bar and copy the link for "All Stats".
We're going to assign the url to tr_url and use the read_html
function from the rvest package to read the web page into R.
tr_url <- "https://www.teamrankings.com/nfl/team-stats"
tr <- read_html(tr_url)
In my experience most tutorials on web scrapping make it out to be easier than it really is. Most focus on one page but don't explain: how to scale up, how to deal with data on more than one page, selecting the proper pages from many, piecing together the data from the various sources (or pages), etc. I don't know if this explainer will be much better but I'll try.
The first thing we need to do is look at how the data is laid out on the site. If you go to the link you'll see that the stats are split between many pages. The individual pages are grouped by type, for example, "Total Offense", "Rushing Offense", "Scoring Defense." Expand one of the groups and the links to the individual pages are displayed.
Follow the link under "Scoring Offense" to Points Per Game, the data is in a HTML table.
The table contains the columns:
Pick another stat from the list . I chose Two Point Conversion Percentage. The columns are the same. Do this a couple more times and it becomes clear that all the team stat pages for the NFL follow this format. So, if we can download the individual tables we shouldn't have much trouble binding them together.
Now we're ready to make a plan to scrape the data.
We're going to take the team stats page we read in earlier, tr, pass
that to the html_nodes. The output is piped into html_attr with
argument "href" so we only get the hyper links with a URL.
tr_links <- tr %>%
html_nodes("a") %>%
html_attr("href")
head(tr_links,10)
## [1] "/" "/login/"
## [3] "/register/" "/"
## [5] "#" "#"
## [7] "#" "#"
## [9] "#" "/football-pool-picks/"
You can see there's a lot of stuff in there we don't need like the login link or the links for other sports. One solution is to filter the list of links for links containing "nfl". But looking at the first results shows that not all "nfl" links are to the pages containing stats.
head(tr_links[str_detect(tr_links,"nfl")])
## [1] "/nfl-survivor-pool-picks/" "/nfl-win-picks/"
## [3] "/nfl-ats-picks/" "/nfl-over-under-picks/"
## [5] "/nfl-money-line-picks/" "/nfl-betting-picks/"
Looking at urls of the individual stat pages on
https://www.teamrankings.com/nfl/team-stats we notice the pattern
"nfl/stat" in all team stat urls. Check to see how many urls are in the
tr_links vector and how many of those are NFL stat pages.
print(length(tr_links))
## [1] 694
print(length(tr_links[str_detect(tr_links,"nfl/stat")]))
## [1] 216
About a third of the pages are stat related.
Let's create a new vector nfl_links to keep things clear.
nfl_links <- tr_links[str_detect(tr_links,"nfl/stat")]
head(nfl_links)
## [1] "/nfl/stats/" "/nfl/stats/"
## [3] "/nfl/stat/points-per-game" "/nfl/stat/average-scoring-margin"
## [5] "/nfl/stat/yards-per-point" "/nfl/stat/yards-per-point-margin"
Looking at the results we notice another issue. This time the overview
page (the one we're on currently) is listed in the results. There's no
stats on this page. We need to remove those results from the vector, but
we're not sure exactly how many more times "nfl/stats/" is in the
nfl_links vector. We'll use the same method we used above with
str_detect but instead we will select the results that come back
FALSE
nfl_links <- nfl_links[!str_detect(nfl_links, "/nfl/stats/")]
# We could do it this way as well
# nfl_links %in% "/nfl/stats/"
We want to keep the entire process tidy, let's start by putting the stat links in a dataframe (or in this case a tibble). It might not be clear now but this will make things a lot easier a few steps from now.
df <- tibble(stat_links = nfl_links)
We could pull the tables from every page but that would be 212 different pages to scrape. IMO overkill for the example. Plus, there wouldn't be much structure to the final table. Instead, we will focus on pulling all the per-play stats. Our final table will be the per-play stats for each team for every per-play category on the teamrankings site.
The way we accomplish this is using the same str_detect method as
above. Create a column using mutate called is_per_play the column
will contain the logical results from str_detect. If the stat_links
column contains "per-play" is_per_play is TRUE. Next filter only the
results where is_per_play is TRUE.
The dataframe is down to 11 rows. These 11 urls are the pages we're going to scrape.
df <- df %>%
mutate(is_per_play = str_detect(stat_links, "per-play")) %>%
filter(is_per_play == TRUE)
glimpse(df)
## Observations: 11
## Variables: 2
## $ stat_links <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-per...
## $ is_per_play <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR...
You've probably noticed that the urls aren't absolute but relative urls.
We're going to need to add the <teamrankings.com> to the front of
the urls in the stat_links columns.
Now that we have the correct path we paste it to the front of the
stat_links column. Notice that we're still keeping all the results in
the same dataframe.
df <- df %>%
mutate(url = paste0('https://www.teamrankings.com', stat_links))
df %>%
head() %>%
knitr::kable()
| stat_links | is_per_play | url |
|---|---|---|
| /nfl/stat/points-per-play | TRUE | https://www.teamrankings.com/nfl/stat/points-per-play |
| /nfl/stat/points-per-play-margin | TRUE | https://www.teamrankings.com/nfl/stat/points-per-play-margin |
| /nfl/stat/yards-per-play | TRUE | https://www.teamrankings.com/nfl/stat/yards-per-play |
| /nfl/stat/first-downs-per-play | TRUE | https://www.teamrankings.com/nfl/stat/first-downs-per-play |
| /nfl/stat/punts-per-play | TRUE | https://www.teamrankings.com/nfl/stat/punts-per-play |
| /nfl/stat/opponent-points-per-play | TRUE | https://www.teamrankings.com/nfl/stat/opponent-points-per-play |
Now, the next piece of code isn't best of way doing things. I tried to
come up with ways using list columns in dataframes. I've come up with
solutions using map and walk from the purrr package but they were
"too cute" to be practical. IMO the code below is readable and works for
the task at hand so we'll leave it as is. I got the idea for using
sample with Sys.sleep() from this Bob Rudis blog
post.
get_page <- function(url){
page <- read_html(url)
Sys.sleep(sample(seq(.25,2.5,.25),1))
page
}
page_data <- map(df$url, get_page)
Now we have a list with 11 items where every item is a web page
downloaded via read_html. We can easily extract the html tables from
the pages using html_table. Instead of a loop this time we'll use
map.
tr_data <- map(page_data, html_table)
The structure of tr_data is a little nasty. It's an 11 item list where
every item is a list of length 1.
print(length(tr_data))
## [1] 11
print(map_dbl(tr_data, length))
## [1] 1 1 1 1 1 1 1 1 1 1 1
print(map_chr(tr_data, class))
## [1] "list" "list" "list" "list" "list" "list" "list" "list" "list" "list"
## [11] "list"
Let's convert the list into one tibble. The code below looks ugly but it's relatively easy to follow.
pluck to "pull out" the main list from tr_datamap2_dfget_page. Now we're
going to take those eleven urls and pass them as the second item to
iterate over in map2_df.tr_data list is converted to a tibble and
then a new column is added with the URL page that the new tibble was
created from. (since tr_data was created from the links in
df$stat_links they line up perfectly).map2_df instead of map2 the eleven tibbles are
binded into one.tibble to set_names to make the column names
more R friendly.tr_data <- pluck(tr_data, 1) %>%
map2_df(df$stat_links,
~as_tibble(.x) %>%
mutate(stat = .y)) %>%
set_names(c(
'rank',
'team',
'current_seas',
'last_3',
'last_1',
'home',
'away',
'last_seas',
'stat'
))
Let's look at the result. We have 9 columns. current_seas and
last_seas are the stats for 2017 and 2016 respectively. stat is the
type of football statistic referenced. Because the season is only four
weeks in (at the time of writing this) let's drop the last_3 and
last_1 columns.
tr_data <- tr_data %>%
select(-last_3, -last_1)
Even after the cleanup this isn't the most efficient format for
analyzing the data. We want the stats to be the columns and the teams to
be repeated four times, for the years and the locations. The text format
of the stat column isn't very R friendly for column names.
glimpse(tr_data)
## Observations: 352
## Variables: 7
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ team <chr> "LA Rams", "New Orleans", "Philadelphia", "New En...
## $ current_seas <dbl> 0.461, 0.448, 0.434, 0.429, 0.420, 0.418, 0.397, ...
## $ home <dbl> 0.396, 0.486, 0.436, 0.416, 0.388, 0.457, 0.421, ...
## $ away <dbl> 0.518, 0.410, 0.431, 0.444, 0.457, 0.383, 0.366, ...
## $ last_seas <dbl> 0.233, 0.424, 0.340, 0.423, 0.390, 0.341, 0.383, ...
## $ stat <chr> "/nfl/stat/points-per-play", "/nfl/stat/points-pe...
stat names.There's more than one way to accomplish most things in R. This might not
be the best way. I find it to be cleaner than using a function like
recode. We're going to create a new dataframe with the unique values
of the stat column and the new names as the other. We're going to join
that dataframe to the tr_data and remove the original stat column.
First let's look at the stat names.
| raw_stat_names |
|---|
| /nfl/stat/points-per-play |
| /nfl/stat/points-per-play-margin |
| /nfl/stat/yards-per-play |
| /nfl/stat/first-downs-per-play |
| /nfl/stat/punts-per-play |
| /nfl/stat/opponent-points-per-play |
| /nfl/stat/opponent-yards-per-play |
| /nfl/stat/opponent-first-downs-per-play |
| /nfl/stat/opponent-punts-per-play |
| /nfl/stat/penalties-per-play |
| /nfl/stat/opponent-penalties-per-play |
These clearly aren't R friendly column names. Some cleanup is needed.
new_col_names <- raw_stat_names %>%
str_replace_all('/nfl/stat/','') %>%
str_replace_all('-','_') %>%
str_replace_all('points','pts') %>%
str_replace_all('yards','yds') %>%
str_replace_all('opponent','opp') %>%
str_replace_all('per_play','pp')
df_col_names <-
tibble(stat = unique(tr_data$stat),
new_cols = new_col_names)
df_col_names %>%
knitr::kable()
| stat | new_cols |
|---|---|
| /nfl/stat/points-per-play | pts_pp |
| /nfl/stat/points-per-play-margin | pts_pp_margin |
| /nfl/stat/yards-per-play | yds_pp |
| /nfl/stat/first-downs-per-play | first_downs_pp |
| /nfl/stat/punts-per-play | punts_pp |
| /nfl/stat/opponent-points-per-play | opp_pts_pp |
| /nfl/stat/opponent-yards-per-play | opp_yds_pp |
| /nfl/stat/opponent-first-downs-per-play | opp_first_downs_pp |
| /nfl/stat/opponent-punts-per-play | opp_punts_pp |
| /nfl/stat/penalties-per-play | penalties_pp |
| /nfl/stat/opponent-penalties-per-play | opp_penalties_pp |
note: for the example below we're going to remove the Home and Away
columns but you don't need to.
Now we remove the stat column and then rename the new column we added
stat. Next we gather the dataframe by the season creating two new
columns year and val. This leaves us with a dataframe with four
columns; Team, stat, year, val. We want the values in the stat
column to be the new column names. Call spread on stat and use the
values in val. Our new dataframe is 64 rows long (32 teams x 2
seasons) and 13 columns wide (the 11 stats + year + Team). Fix the
year column by replacing "X2015"/"X2016" with 2015/2016.
tr_data <- tr_data %>%
left_join(df_col_names, by = "stat") %>%
select(-stat, -home, -away,-rank) %>%
rename(stat = new_cols) %>%
gather(year, val, current_seas:last_seas) %>%
spread(stat, val) %>%
mutate(year = recode(year,
"last_seas" = 2016,
"current_seas" = 2017)) %>%
arrange(year, team)
glimpse(tr_data)
## Observations: 64
## Variables: 13
## $ team <chr> "Arizona", "Atlanta", "Baltimore", "Buffalo...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2...
## $ first_downs_pp <dbl> 0.339, 0.385, 0.293, 0.324, 0.302, 0.335, 0...
## $ opp_first_downs_pp <dbl> 0.288, 0.343, 0.296, 0.332, 0.318, 0.318, 0...
## $ opp_penalties_pp <dbl> 0.06, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0...
## $ opp_pts_pp <dbl> 0.355, 0.379, 0.323, 0.371, 0.388, 0.394, 0...
## $ opp_punts_pp <dbl> 0.08, 0.06, 0.09, 0.07, 0.07, 0.07, 0.07, 0...
## $ opp_yds_pp <dbl> 4.8, 5.6, 5.2, 5.6, 5.6, 5.5, 5.4, 5.9, 5.5...
## $ penalties_pp <dbl> 0.05, 0.05, 0.06, 0.06, 0.05, 0.06, 0.04, 0...
## $ pts_pp <dbl> 0.385, 0.550, 0.318, 0.394, 0.351, 0.289, 0...
## $ pts_pp_margin <dbl> 0.030, 0.171, -0.005, 0.023, -0.038, -0.106...
## $ punts_pp <dbl> 0.07, 0.05, 0.07, 0.07, 0.07, 0.07, 0.07, 0...
## $ yds_pp <dbl> 5.4, 6.7, 5.2, 5.6, 5.2, 5.9, 5.4, 5.1, 6.0...
Now our data is in a tidy format and ready for analysis.
Here's the dataset teamrankings-dataset.csv