I have a list of schools, but some of them are ranked. I want to remove the rank of the schools (at the beginning of the string). When the school is ranked, it looks like this:
(3) Trinity
However, there are some schools that have parentheses at the end of their names, like this:
Concordia (Minn.)
So I don't want to remove the parenthesis if it is at the end of a string.
I'm not quite sure how to do this, but I'm assuming I'll need regex.
To get my data:
library(dplyr)
library(rvest)
library(purrr)
page_num <- seq(4, 16, by = 1) %>%
paste("/", sep = "") %>%
{page_num[-10]}
site <- paste("http://www.uscho.com/scoreboard/division-iii
men/20172018/list-", page_num, sep = "")
get_opponent <- function(x) {
read_html(site[x]) %>%
html_nodes("td:nth-child(2)") %>%
html_text()
}
opponents <- map(seq(1, length(page_num)), get_opponent) %>%
unlist() %>%
tibble()
opponents
We can use sub
here, with the following pattern^
^\s*\(\d+\)\s*(.*)
This matches a leading rank, with possible whitespace before and after it, then it matches and captures the remainder of the string. It then replaces the string with the remainder.
x <- "(3) Trinity"
result <- sub("^\\s*\\(\\d+\\)\\s*(.*)", "\\1", x)
result
[1] "Trinity"