Search code examples
rregexstringdata-cleaning

Remove Pattern at Beginning of String but not at End


I have a list of schools, but some of them are ranked. I want to remove the rank of the schools (at the beginning of the string). When the school is ranked, it looks like this:

(3) Trinity

However, there are some schools that have parentheses at the end of their names, like this:

Concordia (Minn.)

So I don't want to remove the parenthesis if it is at the end of a string.

I'm not quite sure how to do this, but I'm assuming I'll need regex.

To get my data:

library(dplyr)
library(rvest)
library(purrr)

page_num <- seq(4, 16, by = 1) %>%
  paste("/", sep = "") %>%
  {page_num[-10]}

site <- paste("http://www.uscho.com/scoreboard/division-iii 
               men/20172018/list-", page_num, sep = "")

get_opponent <- function(x) {

  read_html(site[x]) %>%
    html_nodes("td:nth-child(2)") %>%
    html_text()

}

opponents <- map(seq(1, length(page_num)), get_opponent) %>%
  unlist() %>%
  tibble()

opponents

Solution

  • We can use sub here, with the following pattern^

    ^\s*\(\d+\)\s*(.*)
    

    This matches a leading rank, with possible whitespace before and after it, then it matches and captures the remainder of the string. It then replaces the string with the remainder.

    x <- "(3) Trinity"
    result <- sub("^\\s*\\(\\d+\\)\\s*(.*)", "\\1", x)
    result
    
    [1] "Trinity"
    

    Demo