Search code examples
rdataframefilterdata-cleaningna

Identify partial duplicate rows where one row is NA and its duplicate is non-NA


This is my first time posting on here, so please bear with me! In R, i am trying to identify rows that are partially duplicate, based on the following conditions:

  • for one row, col1 must be non-NA. it's duplicate must be NA
  • col2 can be different, but all other cols must be the same

Example df:

df = read.table(text = "
col1    col2    col3    col4    col5
a   928     0   TRUE    1
b   1028    0   FALSE   1
c   394     1   TRUE    1
NA  239     4   TRUE    0
NA  102     0   TRUE    1", h = T)

you can see that row 1 and 5 fit the conditions i am looking for: non-NA and NA in col1, col2 is not identical but all other cols are identical. So I would call these rows partial duplicates.

I want the resulting output df to have the duplicate rows (e.g. rows 1 and 5) only. desired output would look like:

df = read.table(text = "
col1    col2    col3    col4    col5
a   928     0   TRUE    1
NA  102     0   TRUE    1", h = T)

note this is just an example df and the actual df is much larger.

thanks!


Solution

  • Basically, what I'm doing below is checking if any col1 value is NA, and any is not NA, for each group of identical rows in the third to the last column. Hope this makes sense!

    all_other_cols <- colnames(df) |> tail(-2)
    
    dplyr::filter(df, any(is.na(col1)) && any(!is.na(col1)), .by = all_other_cols)
    
    

    Output:

      col1 col2 col3 col4 col5
    1    a  928    0 TRUE    1
    2 <NA>  102    0 TRUE    1