Hello I have a sample dataset as follows.
# Load the tidyverse package
library(tidyverse)
# Create the dataset
id <- 1:6
model <- c("0RB3211", NA, "0RB4191",
NA, "0RB4033", NA)
UPC <- c("805289119081", "DK_0RB3447CP_RBCP 50", "8053672006360",
"Green_Classic_G-15_Polar_1.67_PREM_SV", "805289044604",
"DK_0RB2132CP_RBCP 55")
df <- tibble(id, model, UPC)
For the missing values in 'model' column, if its corresponding UPC starts with DK, I need to extract 7-digit number and letter after first underscore and then put it into the column 'model'. For example, for the second row, I need to put "0RB3447" into the 'model' column, for the fourth row, I need to delete the whole row and for the last row, I need to put "0RB2132" into the 'model' column.
# Manipulate the dataset
df_cleaned <- df %>%
rowwise() %>%
mutate(model = ifelse(is.na(model) & str_detect(UPC, "^DK"),
str_extract(UPC, "\\d{2}RB\\d{4}"),
model)) %>%
ungroup() %>%
filter(!(is.na(model) & str_detect(UPC, "[^0-9]")))
# Display the cleaned dataset
print(df_cleaned)
However, it only returns this wrong result.
How can I modify my previous code? Really appreciate it.
Instead of the ifelse
, another option is coalesce
with the existing model
so that it only replaces the NAs in the model with the substring from the UPC extracted with str_replace
. Later keep only the rows where the model starts with 0.
library(dplyr)
library(stringr)
df %>%
mutate(model = coalesce(model,
str_replace(UPC, ".*_(0[^_]+\\d+)[A-Z]+_.*", "\\1"))) %>%
filter(str_detect(model, "^0"))
-output
# A tibble: 5 × 3
id model UPC
<int> <chr> <chr>
1 1 0RB3211 805289119081
2 2 0RB3447 DK_0RB3447CP_RBCP 50
3 3 0RB4191 8053672006360
4 5 0RB4033 805289044604
5 6 0RB2132 DK_0RB2132CP_RBCP 55
In the OP's code, rowwise
is not needed as ifelse
is vectorized. Also, \\d{2}
will not match some of the strings as 0RB..
, shows only a single digit and not 2 before RB. Therefore, use +
to denote one or more digits
df %>%
mutate(model = ifelse(is.na(model) & str_detect(UPC, "^DK"),
str_extract(UPC, "(?<=_)\\d+RB\\d{4}"),
model)) %>%
filter(complete.cases(model))
-output
# A tibble: 5 × 3
id model UPC
<int> <chr> <chr>
1 1 0RB3211 805289119081
2 2 0RB3447 DK_0RB3447CP_RBCP 50
3 3 0RB4191 8053672006360
4 5 0RB4033 805289044604
5 6 0RB2132 DK_0RB2132CP_RBCP 55