This is my first time posting on here, so please bear with me! In R, i am trying to identify rows that are partially duplicate, based on the following conditions:
Example df:
df = read.table(text = "
col1 col2 col3 col4 col5
a 928 0 TRUE 1
b 1028 0 FALSE 1
c 394 1 TRUE 1
NA 239 4 TRUE 0
NA 102 0 TRUE 1", h = T)
you can see that row 1 and 5 fit the conditions i am looking for: non-NA and NA in col1, col2 is not identical but all other cols are identical. So I would call these rows partial duplicates.
I want the resulting output df to have the duplicate rows (e.g. rows 1 and 5) only. desired output would look like:
df = read.table(text = "
col1 col2 col3 col4 col5
a 928 0 TRUE 1
NA 102 0 TRUE 1", h = T)
note this is just an example df and the actual df is much larger.
thanks!
Basically, what I'm doing below is checking if any col1 value is NA, and any is not NA, for each group of identical rows in the third to the last column. Hope this makes sense!
all_other_cols <- colnames(df) |> tail(-2)
dplyr::filter(df, any(is.na(col1)) && any(!is.na(col1)), .by = all_other_cols)
Output:
col1 col2 col3 col4 col5
1 a 928 0 TRUE 1
2 <NA> 102 0 TRUE 1