Search code examples
rdata-munging

In R, expand a table, splitting apart rows distinguished internally


[Edit: The responses are great; thank you. Unfortunately, I oversimplified my reprex. My actual data have dashes inside the parentheses sometimes. Also, as was noted, sometimes there are more than two variants. I am editing my reprex accordingly.]

I am tidying a table of prices for products from a given manufacturer. The models come in variants, with each variant of the same model being the same price, so the original table creator listed the price only once, and distinguished the models with an internal parenthesized alternative scheme. E.g. "1000-a" and "1000-b" are variants of the same model, indicated on the price list as "1000-a(b)". I would like to list each model out on its own row.

Reprex:

prices <- data.frame(
  model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
  price = seq(1999,5999,1000)
)
> prices
            model price
1      1000-a(-b)  1999
2       2000-a(b)  2999
3        150-L(R)  3999
4      TX-a(-b)-J  4999
5 350-LX(-S)(-AB)  5999

Sometimes the differences are single characters, sometimes multiple. Sometimes they are at the end of the model number, sometimes in the middle. I would like code which produces this:

> desired_prices
    model price
1  1000-a  1999
2  1000-b  1999
3  2000-a  2999
4  2000-b  2999
5   150-L  3999
6   150-R  3999
7  TX-a-J  4999
8  TX-b-J  4999
9  350-LX  5999
10  350-S  5999
11 350-AB  5999

This seems related to the question Split comma-separated strings in a column into separate rows, but in my case the values aren't comma delimited.

How may I produce this split/expansion? A tidyverse solution is preferred but not required.


Solution

  • Updated to reflect OPs revised data frame. Key regex steps:

    1. Separate all variations i.e. parentheses and content
    2. Clean reference model
    3. Separate row for each of the additional model ids
    4. Remove parentheses and dash from additional model ids
    5. Replace reference id with additional model id
    library(dplyr)
    library(tidyr)
    library(stringr)
    
      prices %>% 
      mutate(var = str_extract_all(model, "\\([\\-A-z]{1,3}\\)"),
             mod_1 = str_remove_all(model, "\\(([\\-A-z]{1,3})\\)")) %>% 
      unnest(var) %>% 
      mutate(var = str_remove_all(var, "[(\\-)]"),
             mod_2 = str_replace(mod_1, "(?<=\\-)[A-z]{1,3}", var)) %>% 
      select(price, mod_1, mod_2) %>% 
      pivot_longer(-price) %>% 
      select(-name) %>% 
      distinct()
    
    #> # A tibble: 11 x 2
    #>    price value 
    #>    <dbl> <chr> 
    #>  1  1999 1000-a
    #>  2  1999 1000-b
    #>  3  2999 2000-a
    #>  4  2999 2000-b
    #>  5  3999 150-L 
    #>  6  3999 150-R 
    #>  7  4999 TX-a-J
    #>  8  4999 TX-b-J
    #>  9  5999 350-LX
    #> 10  5999 350-S 
    #> 11  5999 350-AB
    

    Created on 2021-08-26 by the reprex package (v2.0.0)

    data

    prices <- data.frame(
      model = c("1000-a(-b)", "2000-a(b)", "150-L(R)", "TX-a(-b)-J", "350-LX(-S)(-AB)"),
      price = seq(1999,5999,1000))
    

    Initial dataset from OP:

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    
    prices <- data.frame(
      model = c("1000-a(b)", "2000-a(b)", "150-lx(sn)", "350-lx(sn)", "TXa(b)-J"),
      price = seq(1999,5999,1000)
    )
    
    
    prices1 <- 
      prices %>% 
      mutate(model1 = str_remove(model, "\\(([a-z]{1,2})\\)"),
             model2 = str_remove(model, "[a-z]{1,2}(?=\\()"),
             model2 = str_replace_all(model2, "[()]", "")) %>% 
      select(-model)%>% 
      pivot_longer(-price, values_to = "model") %>% 
      select(-name)
    
    
    prices1       
    #> # A tibble: 10 x 2
    #>    price model 
    #>    <dbl> <chr> 
    #>  1  1999 1000-a
    #>  2  1999 1000-b
    #>  3  2999 2000-a
    #>  4  2999 2000-b
    #>  5  3999 150-lx
    #>  6  3999 150-sn
    #>  7  4999 350-lx
    #>  8  4999 350-sn
    #>  9  5999 TXa-J 
    #> 10  5999 TXb-J
    

    Created on 2021-08-26 by the reprex package (v2.0.0)