Search code examples
rreplacedata-cleaningstringrmarket-basket-analysis

Stringr pattern matching: Is there a way to identify multiple product descriptions?


In my dataset, i have product descriptions that appear as:

  1. Product A, Product A, Product A

and in other rows as

  1. Product A, Product B, Product A, Product B

and in some rows, as just

  1. Product A

Initially, my dataset had strings in the format:

  1. Product A, Product B, Product A, Product B, Product A, Product B

and

  1. Product A, Product A, Product A

Since I wanted just one instance of each product, I resolved this issue by using the following code:

df$lengths <- str_length(df$items)

df$new_items <- str_sub(df$items, 1, df$lengths/3)

Is there a way to solve the above problem by modifying this code?

df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", 
"Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", 
"Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", 
"samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s(3+32)", 
"Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)"
)), class = "data.frame", row.names = c(NA, -6L))

Solution

  • EDIT:

    If the comma-separated strings do not always contain identical elements, more complex solutions are in order:

    Data:

    Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")
    

    Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:

    library(dplyr)
    library(stringr)
    data.frame(Product_name) %>%
      mutate(Product_name = str_extract_all(Product_name, "((?!\\s)[^,]+)(?!.*\\1)"))
                                                 Product_name
    1  Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
    2  Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
    3                                     Samsung A32 (6+128)
    4                                     samsung A02s (3+32)
    5                                 Xiaomi Redmi 10 (6+128)
    6                               Redmi Note 10 Pro (6+128)
    

    Solution 2: Based on tidyr functionality

    library(tidyr)
    library(dplyr)
    data.frame(Product_name) %>%
      # create identifier:
      mutate(row = row_number()) %>%
      # separate rows into individual elements:
      separate_rows(Product_name, sep = ", ") %>%
      group_by(row) %>%
      # remove duplicated elements:
      filter(!duplicated(Product_name)) %>%
      # put distinct elements back into the same row:
      summarise(Product_name = toString(Product_name))
    # A tibble: 6 x 2
        row Product_name                                          
      <int> <chr>                                                 
    1     1 Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (3+32)
    2     2 Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (4+64)
    3     3 Samsung A32 (6+128)                                   
    4     4 samsung A02s (3+32)                                   
    5     5 Xiaomi Redmi 10 (6+128)                               
    6     6 Redmi Note 10 Pro (6+128)
    

    Before edit:

    This solution is based on the assumption that the comma-separated elements in the strings are always identical:

    library(stringr)
    str_extract(Product_name, "[^,]+")
    [1] "Samsung Galaxy A03s (4+64)" "Samsung Galaxy A03s (3+32)"
    [3] "Samsung A32 (6+128)"        "samsung A02s (3+32)"       
    [5] "Xiaomi Redmi 10 (6+128)"    "Redmi Note 10 Pro (6+128)"
    

    Data:

    Product_name = c("Samsung Galaxy A03s (4+64), Samsung Galaxy A03s (4+64)", "Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32), Samsung Galaxy A03s (3+32)", "Samsung A32 (6+128), Samsung A32 (6+128), Samsung A32 (6+128)", "samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32), samsung A02s (3+32)", "Xiaomi Redmi 10 (6+128), Xiaomi Redmi 10 (6+128)", "Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128), Redmi Note 10 Pro (6+128)")