Search code examples
rweb-scrapinghtml-parsingrvest

Not sure how to separate a column of data that I scraped


I have scraped data from the schedule of Albany Women's Basketball team from an espn website and the win/loss column is formatted like this: W 77-70, which means that Albany won 77-70. I want to separate this so that one column shows how many points Albany scored, and how many points the opponent scored.

Here is my code, not sure what to do next:

library(rvest)
library(stringr)
library(tidyr)

w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)

w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)

w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)

Solution

  • You can firstly trim out those rows that are not offering real results by using grepl function and then use regex for getting specific information:

    w <- w[grepl("-", w$Score),]
    
    gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\\1,\\2,\\3", w$Score) %>% 
      strsplit(., split = ",") %>%
      lapply(function(x){
        data.frame( 
          result  = x[1],
          oponent = ifelse(x[1] == "L", x[2], x[3]),
          albany  = ifelse(x[1] == "W", x[2], x[3])
        )
      }) %>%
      do.call('rbind',.) %>% 
      cbind(w,.) -> w2
    
    head(w2)
    #         Date           Opponent  Score    Record result oponent albany
    #3  Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0)      L      74     37
    #4 Mon, Nov 12           @Cornell L48-34 0-2 (0-0)      L      48     34
    #5 Wed, Nov 14        vsManhattan W60-54 1-2 (0-0)      W      54     60
    #6 Sun, Nov 18           @Rutgers L65-39 1-3 (0-0)      L      65     39
    #7 Wed, Nov 21          @Monmouth L64-56 1-4 (0-0)      L      64     56
    #8 Sun, Nov 25       vsHoly Cross L56-50 1-5 (0-0)      L      56     50