How to download html tables with rvest and xml2

Published June 13, 2020

Mark Berman recently tweeted his interview with Ginny Finch, 2016 USA Olympic Boxing team member:

According to insidehook, Fuch tested positive for Letrozole and GW501516, two drugs listed as “Hormone and Metabolic Modulators” on the World Anti-Doping Agency (WADA) Prohibited List.

How many postive tests without negative consequences?

In a post on their website, USADA determined Fuch was not at fault, and would not face any time of ineligibility:

USADA determined that Fuchs’ male partner was using therapeutic doses of letrozole and GW1516 and the low amounts of letrozole metabolite and GW1516 metabolites detected in her sample were consistent with recent exposure to the substances via sexual transmission. Additionally, a WADA-accredited laboratory confirmed that products possessed by Fuchs’ partner contained therapeutic amounts of letrozole and GW1516.

Fuch’s case got me wondering how many athletes have had positive test results and had their athletic careers continue uninterrupted. I decided I would download the USADA data see for myself.

The Data

The USADA sanction data is available on the USADA website.

It’s a searchable database in the browser, but I want to download this into R so I can wrangle and visualize it.

In order to extract data from the website, I’ll be using the rvest package written by Hadley Wickham to scrape the html code. The code chunk below loads the packages needed to reproduce the graphics.

library(tidyverse)
library(rvest)
library(xml2)
library(janitor)
library(ggrepel)

Scraping the USADA website

The website for these data is available here. The rvest::read_html() and rvest::html_nodes() functions extract the content from the table in the Web page and translates it from HTML into a data frame.

USADA_url <- "https://www.usada.org/testing/results/sanctions/"
USADA_extraction <- USADA_url %>%
     xml2::read_html() %>%
     rvest::html_nodes("table")

Store and explore* refers to storing an output of a call to an object, then checking the contents of that new object. This is a great way to get to know R, objet-oriented programming, and how functions work together in packages.

Check the structure of the extraction

Look at the structure of USADA_extraction.

# check the structure of the new USADA_extraction object
USADA_extraction %>% utils::str()
## List of 1
##  $ :List of 2
##   ..$ node:<externalptr> 
##   ..$ doc :<externalptr> 
##   ..- attr(*, "class")= chr "xml_node"
##  - attr(*, "class")= chr "xml_nodeset"

This contains a node and a doc in a List.

Check the class of the extraction

If I check the class of the list we extracted, we find…

USADA_extraction %>% class()
## [1] "xml_nodeset"

…this is an xml_nodeset with 2 lists (stored within the 1 list). After a little bit of exploration, I find the table id in the html_node.

USADA_extraction[[1]]
## {html_node}
## <table id="tablepress-1" class="tablepress tablepress-id-1">
## [1] <thead><tr class="row-1 odd">\n<th class="column-1">Athlete</th>\n<th cla ...
## [2] <tbody class="row-hover">\n<tr class="row-2 even">\n<td class="column-1"> ...

Now I know the data I want from this object is in position [[1]] of this list.

Harvest the html table

I can subset the USADA_extraction list with the rvest::html_table() function and store the table contents in the UsadaRaw object. I check my work using the dplyr::glimpse(70).

UsadaRaw <- rvest::html_table(USADA_extraction[[1]])
UsadaRaw %>% dplyr::glimpse(70)
## Rows: 786
## Columns: 5
## $ Athlete              <chr> "Martin, Jeff", "Millican, Morgan", "H…
## $ Sport                <chr> "Cycling", "Weightlifting", "Track and…
## $ `Substance/Reason`   <chr> "Androgenic Anabolic Steroid", "Clomip…
## $ `Sanction Terms`     <chr> "2-Year Suspension; Loss of Results", …
## $ `Sanction Announced` <chr> "10/01/2020", "09/29/2020", "09/24/202…

why dplyr::glimpse(70)? It prints less to the screen and keeps the col width to <80, which is nice for working in plain text.

This reveals a data frame with 786 observations. The contents from the HTML list (USADA_extraction) has been converted to a data frame (UsadaRaw). I’m going to store this data frame as a .csv in a data/raw folder (so I don’t have to scrape it every time I run this script).

# create raw data path
fs::dir_create("data/raw/")
# export the .csv file
write_csv(as.data.frame(UsadaRaw),
          base::paste0("data/raw/",
                       base::noquote(lubridate::today()),
                           "-UsadaRaw.csv"))

Wrangle the sanctions and substance/reason

The clean_names() function from the janitor package lets me quickly format all of the column names in the UsadaRaw data frame. This function has a case argument that allows me to select how I want to format the column names.

The purrr package is excellent for iteration and functional programming. Both of these topics are too large to cover in this post, but I will demonstrate an example with the purrr::map_df() function:

purrr::map_df() takes a data frame (.x) and in input, and applies a function (.f) across it.

Usada <- UsadaRaw %>% janitor::clean_names(case = "snake")
# lowercase text -----
Usada <- purrr::map_df(.x = Usada, .f = stringr::str_to_lower)
utils::head(Usada)
## # A tibble: 6 x 5
##   athlete     sport     substance_reason    sanction_terms      sanction_announ…
##   <chr>       <chr>     <chr>               <chr>               <chr>           
## 1 martin, je… cycling   androgenic anaboli… 2-year suspension;… 10/01/2020      
## 2 millican, … weightli… clomiphene          2-year suspension;… 09/29/2020      
## 3 harroufi, … track an… androgenic anaboli… 8-year suspension;… 09/24/2020      
## 4 fritzen pr… mixed ma… ostarine            6-month suspension  09/21/2020      
## 5 mccullough… weightli… higenamine          20-month suspensio… 09/17/2020      
## 6 gonzalez v… mixed ma… stanozolol; drosta… 2-year suspension   09/11/2020

Filter out the “no fault or negligence” sanction terms

Now I need to identify the athletes with the no fault or negligence outcome in the sanction_terms column. I can do this with dplyr::filter() and stringr::str_detect().

I also need to clean up the track and field responses in the sport column with some help from dplyr::mutate() and stringr::str_replace_all().

I will get a quick count and visualize these with

UsadaNoFaults <- Usada %>%
  dplyr::filter(stringr::str_detect(string = sanction_terms,
                                    pattern = "^no fault or negligence$")) %>%
                                      # replace & with and
  dplyr::mutate(sport = stringr::str_replace_all(string = sport,
                                         # replace this
                                       pattern = "&",
                                         # with this
                                       replacement = "and"))

Graph sanction terms and substances

Now we can create a quick graph of the most common no fault or negligence sanction terms, and label these by the type of substance.

label_data <- UsadaNoFaults %>%
  # count up the sport and sanction_terms
  dplyr::count(sport, sanction_terms, sort = TRUE) %>%
  # ungroup them
  dplyr::ungroup() %>%
  # join back to data
  dplyr::inner_join(x = ., y = UsadaNoFaults,
                    by = c("sport", "sanction_terms")) %>%
  # sort by n
  dplyr::arrange(desc(n)) %>%
  # get the top 15 rows
  dplyr::slice(1:15)

UsadaNoFaults %>%
  # count up the sport and sanction_terms
  dplyr::count(sport, sanction_terms, sort = TRUE) %>%
  # ungroup them
  dplyr::ungroup() %>%
  # top 5
  utils::head(5) %>%
  # plot these by count and sport
  ggplot2::ggplot(aes(x = stats::reorder(sport, n),
                      y = n,
                      group = sanction_terms)) +
  ggplot2::geom_point(aes(color = sanction_terms,
                          size = n),
                      show.legend = FALSE) +
  ggrepel::geom_text_repel(data = label_data,
                           aes(label = substance_reason, size = 2),
                           show.legend = FALSE) +
  ggplot2::labs(y = "Number of 'no fault or negligence' sanctions",
                x = NULL,
                title = "Positive tests without consequences",
                subtitle = "Top 5 sports, athletes, and substances with 'no fault or negligence' terms",
                caption = "source: https://www.usada.org/news/sanctions/") +
  ggplot2::coord_flip() +
  ggplot2::theme_classic(base_size = 9,
                         base_family = "Ubuntu")

This looks like MMA and Track and Field lead the way in term of no fault or negligence, and most of these are meldonium and clenbuterol..