Search code examples
rdplyrstringrrowwise

Separate string column by row efficiently


I'm trying to separate a string column into two pieces based on chopping up the string. It's best illustrated with example below. rowwise does work, but given the size of the data.frame, I'd like to use a more efficient method. How can I avoid using rowwise?

library(dplyr)
library(stringr)
library(tidyr)

#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"

df <- data.frame(size = c(a,b,c))

# Goal is to separate the 'size' column into 'lower' and 'upper' by
# extracting the value contained in the parens and split by a comma.
# Once the column is split into 'upper' and 'lower' I will perform 
# additional operations.

# DESIRED RESULT
  size     lower upper 
  <fct>    <chr> <chr> 
1 (1, 10)  1     10
2 (10, 20) 10    20
3 (20, 30) 20    30

# WHAT I HAVE TRIED

> #This works... but too inefficient
> df %>%
+   rowwise() %>%
+   mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+            str_split("\\(") %>% .[[1]] %>% .[2])
  size     lower
  <fct>    <chr>
1 (1, 10)  1    
2 (10, 20) 10   
3 (20, 30) 20   

> # I'm not sure why this doesn't work
> df %>%
+   mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+            str_split("\\(") %>% .[[1]] %>% .[2])
      size lower
1  (1, 10)     1
2 (10, 20)     1
3 (20, 30)     1

> #Not obivous how to use separate (tidyr)
> df %>%
+   separate(size, sep=",", c("lower", "upper"))
  lower upper
1    (1   10)
2   (10   20)
3   (20   30)

Solution

  • You don't state your goal explicitly, but it seems like you want to extract the first number from a string. This is easy with stringi::str_extract_first_regex

    library(stringi)
    stri_extract_first_regex(df$size, "[0-9]+")
    # [1] "1"  "10" "20"
    

    So in your case,

    df %>% mutate(lower = as.numeric(stri_extract_first_regex, size, "[0-9]+"))
    

    You can extract all numbers with stri_extract_all_regex.


    Based on your edits:

    df$nums = str_extract_all(df$size, "[0-9]+")
    df$lower = as.numeric(sapply(df$nums, `[[`, 1))
    df$upper = as.numeric(sapply(df$nums, `[[`, 2))
    df
    #       size   nums lower upper
    # 1  (1, 10)  1, 10     1    10
    # 2 (10, 20) 10, 20    10    20
    # 3 (20, 30) 20, 30    20    30
    

    Another way to go is to get rid of the parens and whitespace and then use separate:

    df %>%
        mutate(just_nums = str_replace_all(size, "[^0-9,]", "")) %>%
        separate(just_nums, into = c("lower", "upper"))
    #       size lower upper
    # 1  (1, 10)     1    10
    # 2 (10, 20)    10    20
    # 3 (20, 30)    20    30
    

    The regex pattern "[^0-9,]" matches everything except numbers and commas.