I have scraped data from the schedule of Albany Women's Basketball team from an espn website and the win/loss column is formatted like this: W 77-70, which means that Albany won 77-70. I want to separate this so that one column shows how many points Albany scored, and how many points the opponent scored.
Here is my code, not sure what to do next:
library(rvest)
library(stringr)
library(tidyr)
w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)
w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)
w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)
You can firstly trim out those rows that are not offering real results by using grepl
function and then use regex for getting specific information:
w <- w[grepl("-", w$Score),]
gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\\1,\\2,\\3", w$Score) %>%
strsplit(., split = ",") %>%
lapply(function(x){
data.frame(
result = x[1],
oponent = ifelse(x[1] == "L", x[2], x[3]),
albany = ifelse(x[1] == "W", x[2], x[3])
)
}) %>%
do.call('rbind',.) %>%
cbind(w,.) -> w2
head(w2)
# Date Opponent Score Record result oponent albany
#3 Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0) L 74 37
#4 Mon, Nov 12 @Cornell L48-34 0-2 (0-0) L 48 34
#5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) W 54 60
#6 Sun, Nov 18 @Rutgers L65-39 1-3 (0-0) L 65 39
#7 Wed, Nov 21 @Monmouth L64-56 1-4 (0-0) L 64 56
#8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) L 56 50