I'm trying to separate a string column into two pieces based on chopping up the string. It's best illustrated with example below. rowwise
does work, but given the size of the data.frame, I'd like to use a more efficient method. How can I avoid using rowwise
?
library(dplyr)
library(stringr)
library(tidyr)
#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"
df <- data.frame(size = c(a,b,c))
# Goal is to separate the 'size' column into 'lower' and 'upper' by
# extracting the value contained in the parens and split by a comma.
# Once the column is split into 'upper' and 'lower' I will perform
# additional operations.
# DESIRED RESULT
size lower upper
<fct> <chr> <chr>
1 (1, 10) 1 10
2 (10, 20) 10 20
3 (20, 30) 20 30
# WHAT I HAVE TRIED
> #This works... but too inefficient
> df %>%
+ rowwise() %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
<fct> <chr>
1 (1, 10) 1
2 (10, 20) 10
3 (20, 30) 20
> # I'm not sure why this doesn't work
> df %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
1 (1, 10) 1
2 (10, 20) 1
3 (20, 30) 1
> #Not obivous how to use separate (tidyr)
> df %>%
+ separate(size, sep=",", c("lower", "upper"))
lower upper
1 (1 10)
2 (10 20)
3 (20 30)
You don't state your goal explicitly, but it seems like you want to extract the first number from a string. This is easy with stringi::str_extract_first_regex
library(stringi)
stri_extract_first_regex(df$size, "[0-9]+")
# [1] "1" "10" "20"
So in your case,
df %>% mutate(lower = as.numeric(stri_extract_first_regex, size, "[0-9]+"))
You can extract all numbers with stri_extract_all_regex
.
Based on your edits:
df$nums = str_extract_all(df$size, "[0-9]+")
df$lower = as.numeric(sapply(df$nums, `[[`, 1))
df$upper = as.numeric(sapply(df$nums, `[[`, 2))
df
# size nums lower upper
# 1 (1, 10) 1, 10 1 10
# 2 (10, 20) 10, 20 10 20
# 3 (20, 30) 20, 30 20 30
Another way to go is to get rid of the parens and whitespace and then use separate:
df %>%
mutate(just_nums = str_replace_all(size, "[^0-9,]", "")) %>%
separate(just_nums, into = c("lower", "upper"))
# size lower upper
# 1 (1, 10) 1 10
# 2 (10, 20) 10 20
# 3 (20, 30) 20 30
The regex pattern "[^0-9,]"
matches everything except numbers and commas.