Search code examples
rdataframedata-manipulation

Using tidyverse, how can I replace NAs with character pattern, but remove entire row if no match is found?


Hello I have a sample dataset as follows.

# Load the tidyverse package
library(tidyverse)

# Create the dataset
id <- 1:6
model <- c("0RB3211", NA, "0RB4191",
           NA, "0RB4033", NA)
UPC <- c("805289119081", "DK_0RB3447CP_RBCP  50", "8053672006360",
         "Green_Classic_G-15_Polar_1.67_PREM_SV", "805289044604",
         "DK_0RB2132CP_RBCP  55")
df <- tibble(id, model, UPC)

enter image description here

For the missing values in 'model' column, if its corresponding UPC starts with DK, I need to extract 7-digit number and letter after first underscore and then put it into the column 'model'. For example, for the second row, I need to put "0RB3447" into the 'model' column, for the fourth row, I need to delete the whole row and for the last row, I need to put "0RB2132" into the 'model' column.

Here is my expected result: enter image description here

# Manipulate the dataset
df_cleaned <- df %>%
  rowwise() %>%
  mutate(model = ifelse(is.na(model) & str_detect(UPC, "^DK"),
                        str_extract(UPC, "\\d{2}RB\\d{4}"),
                        model)) %>%
  ungroup() %>%
  filter(!(is.na(model) & str_detect(UPC, "[^0-9]")))

# Display the cleaned dataset
print(df_cleaned)

However, it only returns this wrong result. enter image description here

How can I modify my previous code? Really appreciate it.


Solution

  • Instead of the ifelse, another option is coalesce with the existing model so that it only replaces the NAs in the model with the substring from the UPC extracted with str_replace. Later keep only the rows where the model starts with 0.

    library(dplyr)
    library(stringr)
     df %>% 
      mutate(model = coalesce(model,
          str_replace(UPC, ".*_(0[^_]+\\d+)[A-Z]+_.*", "\\1"))) %>% 
       filter(str_detect(model, "^0"))
    

    -output

    # A tibble: 5 × 3
         id model   UPC                  
      <int> <chr>   <chr>                
    1     1 0RB3211 805289119081         
    2     2 0RB3447 DK_0RB3447CP_RBCP  50
    3     3 0RB4191 8053672006360        
    4     5 0RB4033 805289044604         
    5     6 0RB2132 DK_0RB2132CP_RBCP  55
    

    In the OP's code, rowwise is not needed as ifelse is vectorized. Also, \\d{2} will not match some of the strings as 0RB.., shows only a single digit and not 2 before RB. Therefore, use + to denote one or more digits

    df %>%
      mutate(model = ifelse(is.na(model) & str_detect(UPC, "^DK"),
                            str_extract(UPC, "(?<=_)\\d+RB\\d{4}"),
                            model)) %>%
      filter(complete.cases(model))
    

    -output

    # A tibble: 5 × 3
         id model   UPC                  
      <int> <chr>   <chr>                
    1     1 0RB3211 805289119081         
    2     2 0RB3447 DK_0RB3447CP_RBCP  50
    3     3 0RB4191 8053672006360        
    4     5 0RB4033 805289044604         
    5     6 0RB2132 DK_0RB2132CP_RBCP  55