[Edit: The responses are great; thank you. Unfortunately, I oversimplified my reprex. My actual data have dashes inside the parentheses sometimes. Also, as was noted, sometimes there are more than two variants. I am editing my reprex accordingly.]
I am tidying a table of prices for products from a given manufacturer. The models come in variants, with each variant of the same model being the same price, so the original table creator listed the price only once, and distinguished the models with an internal parenthesized alternative scheme. E.g. "1000-a" and "1000-b" are variants of the same model, indicated on the price list as "1000-a(b)". I would like to list each model out on its own row.
Reprex:
prices <- data.frame(
model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
price = seq(1999,5999,1000)
)
> prices
model price
1 1000-a(-b) 1999
2 2000-a(b) 2999
3 150-L(R) 3999
4 TX-a(-b)-J 4999
5 350-LX(-S)(-AB) 5999
Sometimes the differences are single characters, sometimes multiple. Sometimes they are at the end of the model number, sometimes in the middle. I would like code which produces this:
> desired_prices
model price
1 1000-a 1999
2 1000-b 1999
3 2000-a 2999
4 2000-b 2999
5 150-L 3999
6 150-R 3999
7 TX-a-J 4999
8 TX-b-J 4999
9 350-LX 5999
10 350-S 5999
11 350-AB 5999
This seems related to the question Split comma-separated strings in a column into separate rows, but in my case the values aren't comma delimited.
How may I produce this split/expansion? A tidyverse solution is preferred but not required.
Updated to reflect OPs revised data frame. Key regex steps:
library(dplyr)
library(tidyr)
library(stringr)
prices %>%
mutate(var = str_extract_all(model, "\\([\\-A-z]{1,3}\\)"),
mod_1 = str_remove_all(model, "\\(([\\-A-z]{1,3})\\)")) %>%
unnest(var) %>%
mutate(var = str_remove_all(var, "[(\\-)]"),
mod_2 = str_replace(mod_1, "(?<=\\-)[A-z]{1,3}", var)) %>%
select(price, mod_1, mod_2) %>%
pivot_longer(-price) %>%
select(-name) %>%
distinct()
#> # A tibble: 11 x 2
#> price value
#> <dbl> <chr>
#> 1 1999 1000-a
#> 2 1999 1000-b
#> 3 2999 2000-a
#> 4 2999 2000-b
#> 5 3999 150-L
#> 6 3999 150-R
#> 7 4999 TX-a-J
#> 8 4999 TX-b-J
#> 9 5999 350-LX
#> 10 5999 350-S
#> 11 5999 350-AB
Created on 2021-08-26 by the reprex package (v2.0.0)
data
prices <- data.frame(
model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
price = seq(1999,5999,1000))
Initial dataset from OP:
library(dplyr)
library(tidyr)
library(stringr)
prices <- data.frame(
model = c("1000-a(b)", "2000-a(b)", "150-lx(sn)", "350-lx(sn)", "TXa(b)-J"),
price = seq(1999,5999,1000)
)
prices1 <-
prices %>%
mutate(model1 = str_remove(model, "\\(([a-z]{1,2})\\)"),
model2 = str_remove(model, "[a-z]{1,2}(?=\\()"),
model2 = str_replace_all(model2, "[()]", "")) %>%
select(-model)%>%
pivot_longer(-price, values_to = "model") %>%
select(-name)
prices1
#> # A tibble: 10 x 2
#> price model
#> <dbl> <chr>
#> 1 1999 1000-a
#> 2 1999 1000-b
#> 3 2999 2000-a
#> 4 2999 2000-b
#> 5 3999 150-lx
#> 6 3999 150-sn
#> 7 4999 350-lx
#> 8 4999 350-sn
#> 9 5999 TXa-J
#> 10 5999 TXb-J
Created on 2021-08-26 by the reprex package (v2.0.0)