Search code examples
rdataframeduplicatesmultiple-columns

Duplicates for multiple variables in R


i have this code in which i loaded a CSV. It's meant to delete the duplicated rows that have same values, characters, strings,etc. on a specific variable, in this case ID,DATE and Dx.

> alt
ID             RISK GENDER STATUS AGE DAY MONTH YEAR  DATE    SALA    Dx   DAY2  MONTH2 YEAR2  DATE2   Dx1     STATE1  
1  740010662    5      2      2   23  29    12  2009   40176   13   Z33.X    1      1   2010   40179  O82.9    please
2  347866388    5      2      1   23  31    12  2009   40178   13   O06.4    1      1   2010   40179  O06.4      help
3  705280124    5      2      2   33  31    12  2009   40178   13   Z33.X    1      1   2010   40179  O80.9        me
4  791125002    5      2      1   30  30    12  2009   40177   13   O33.5    1      1   2010   40179  O82.9       im 
5  469833092    5      2      1   26  31    12  2009   40178   13   Z33.X    1      1   2010   40179  O80.9    really
6  855318644    5      2      1   35  30    12  2009   40177   13   O47.9    1      1   2010   40179  O80.9   desperate
7  410886387    5      2      2   29  31    12  2009   40178   13   Z33.X    1      1   2010   40179  O80.9      been
8  486326736    5      2      1   37  30    12  2009   40177   13   O72.0    1      1   2010   40179  O82.9   banging
9  995190824    5      2      1   22   1     1  2010   40179   13   Z33.X    1      1   2010   40179  O80.9       my 
10 896565718    5      2      1   28  30    12  2009   40177   13   Z33.X    1      1   2010   40179  O82.9      head
11 347866388    5      2      1   23  31    12  2009   40178   13   O06.4    1      1   2010   40179  O06.4        to
12 855318644    5      2      1   35  30    12  2009   40177   13   O47.9    1      1   2010   40179  O80.9      the 
13 995190824    5      2      1   22   1     1  2010   40179   13   Z33.X    1      1   2010   40179  O80.9      wall

First i loaded the csv

#1.Load database
BU<-read.csv(choose.files(),header = T,)
alt<- BU

Followed by this code

# 2.eliminates duplicates according to multiple variables

# df1<-data frame with unique observations
# df2<-dataframe with duplicated observations

df1 <- alt[ !( duplicated(alt[,1]) & duplicated(alt[,9]) &  duplicated(alt[,11]) ), ]

df2 <- alt[ !(!( duplicated(alt[,1]) & duplicated(alt[,9]) & duplicated(alt[,11]) )), ]

and voila, it works, these being the results.

> df1
      ID       RISK GENDER STATUS AGE DAY MONTH YEAR  DATE   SALA    Dx     DAY2  MONTH2 YEAR2 DATE2   Dx1     STATE
1  740010662    5      2      2    23  29    12 2009  40176   13    Z33.X    1      1     2010 40179   O82.9    please
2  347866388    5      2      1    23  31    12 2009  40178   13    O06.4    1      1     2010 40179   O06.4      help
3  705280124    5      2      2    33  31    12 2009  40178   13    Z33.X    1      1     2010 40179   O80.9        me
4  791125002    5      2      1    30  30    12 2009  40177   13    O33.5    1      1     2010 40179   O82.9       im 
5  469833092    5      2      1    26  31    12 2009  40178   13    Z33.X    1      1     2010 40179   O80.9    really
6  855318644    5      2      1    35  30    12 2009  40177   13    O47.9    1      1     2010 40179   O80.9 desperate
7  410886387    5      2      2    29  31    12 2009  40178   13    Z33.X    1      1     2010 40179   O80.9      been
8  486326736    5      2      1    37  30    12 2009  40177   13    O72.0    1      1     2010 40179   O82.9   banging
9  995190824    5      2      1    22   1     1 2010  40179   13    Z33.X    1      1     2010 40179   O80.9       my 
10 896565718    5      2      1    28  30    12 2009  40177   13    Z33.X    1      1     2010 40179   O82.9      head
> df2
      ID       RISK GENDER STATUS AGE DAY MONTH YEAR  DATE   SALA    Dx    DAY2 MONTH2   YEAR2  DATE2    Dx1     STATE
11 347866388    5      2      1    23  31    12 2009 40178   13     O06.4    1      1      2010   40179    O06.4    to
12 855318644    5      2      1    35  30    12 2009 40177   13     O47.9    1      1      2010   40179    O80.9  the 
13 995190824    5      2      1    22   1     1 2010 40179   13     Z33.X    1      1      2010   40179    O80.9  wall

which works. Since it deletes rows 11,12,and 13 which are duplicates of rows 2,6 and 9.

HOWEVER the original database is about 18,000 observations and 27 variables and there is where the code fails. It seems it considers duplicates based on only one or two variables not the three variables, ID,DATE and Dx.


Solution

  • I think you're looking for

    alt[!duplicated(alt[c('ID','DATE','Dx')]),];
    

    When given a data.frame, the duplicated() function takes into account all columns in the data.frame when deciding which rows are duplicates.

    But beware the caveat:

    The data frame method works by pasting together a character representation of the rows separated by \r, so may be imperfect if the data frame has characters with embedded carriage returns or columns which do not reliably map to characters.


    Response to your comment: The two are not equivalent. My code selects unique rows, while your code selects rows that have at least one column whose cell value does not duplicate a prior value in the column. Your code will fail to select rows whose every cell value is individually a duplicate of a higher-up value in the column, but whose row taken in its entirety is a unique row against all prior rows.

    Here's a simple demonstration:

    g <- expand.grid(x=1:3,y=4:6)[rep(1:9,2L),];
    cbind(g,dennis=!duplicated(g$x) | !duplicated(g$y),bgoldst=!duplicated(g));
    ##     x y dennis bgoldst
    ## 1   1 4   TRUE    TRUE
    ## 2   2 4   TRUE    TRUE
    ## 3   3 4   TRUE    TRUE
    ## 4   1 5   TRUE    TRUE
    ## 5   2 5  FALSE    TRUE
    ## 6   3 5  FALSE    TRUE
    ## 7   1 6   TRUE    TRUE
    ## 8   2 6  FALSE    TRUE
    ## 9   3 6  FALSE    TRUE
    ## 1.1 1 4  FALSE   FALSE
    ## 2.1 2 4  FALSE   FALSE
    ## 3.1 3 4  FALSE   FALSE
    ## 4.1 1 5  FALSE   FALSE
    ## 5.1 2 5  FALSE   FALSE
    ## 6.1 3 5  FALSE   FALSE
    ## 7.1 1 6  FALSE   FALSE
    ## 8.1 2 6  FALSE   FALSE
    ## 9.1 3 6  FALSE   FALSE
    

    For example, note how the first differing row, x=2 y=5, is dropped by your code, because x=2 is a duplicate and y=5 is a duplicate, but it is included by my code, because the row x=2 y=5 taken in its entirety is unique against all prior rows.