Search code examples
rgsubstrsplit

Need to separate strings into multiple variables based on numeric versus non-numeric


I have a data frame with one variable. It looks something like this:

df <- data.frame(c("25 Edgemont 52 Sioux County", "57 Burke 88 Papillion-LaVista South"))

To provide more context, each observation/row is a basketball game score. I would like to separate into four data frame columns that splits the numbers and team names up. So for example, the first row would end up as "25" in first column, "Edgemont" in second column, "52" in third column, and Sioux City in fourth column.

I've tried the below and various SO suggestions but can't get the desired results:

df2 <- strsplit(gsub("([0-9]*)([a-z]*)([0-9]*)([a-z]*)", "\\1 \\2 \\3 \\4", df), " ")

Solution

  • 1) dplyr/tidyr Replace each number with a semicolon, that number and another semicolon and then separate on the semicolons plus optional surrounding whitespace.

    library(dplyr)
    library(tidyr)
    
    # input
    df <- data.frame(V1 = c("25 Edgemont 52 Sioux County", 
                            "57 Burke 88 Papillion-LaVista South"))
    
    df %>%
      mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>%
      separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *")
    ##   No1       Let1 No2                     Let2
    ## 1  25  Edgemont   52             Sioux County
    ## 2  57     Burke   88  Papillion-LaVista South
    

    1a) read.table We can use the same gsub as in (1) but then separate it using read.table. No packages are used.

    read.table(text = gsub("(\\d+)", ";\\1;", df$V1), sep = ";", as.is = TRUE,
      strip.white = TRUE, col.names = c(NA, "No1", "Let1", "No2", "Let2"))[-1]
    ##   No1     Let1 No2                    Let2
    ## 1  25 Edgemont  52            Sioux County
    ## 2  57    Burke  88 Papillion-LaVista South
    

    2) strcapture We can use strcapture from base R:

    proto <- list(No1 = integer(0), Let1 = character(0),
                  No2 = integer(0), Let2 = character(0))
    strcapture("(\\d+) (.*) (\\d+) (.*)", df$V1, proto)
    ##   No1     Let1 No2                    Let2
    ## 1  25 Edgemont  52            Sioux County
    ## 2  57    Burke  88 Papillion-LaVista South
    

    2a) read.pattern We can use read.pattern with the same pattern as in (2):

    library(gsubfn)
    
    read.pattern(text = format(df$V1), pattern = "(\\d+) (.*) (\\d+) (.*)", 
      col.names = c("No1", "Let1", "No2", "Let2"), as.is = TRUE, strip.white = TRUE)
    ##   No1     Let1 No2                    Let2
    ## 1  25 Edgemont  52            Sioux County
    ## 2  57    Burke  88 Papillion-LaVista South