i have this code in which i loaded a CSV. It's meant to delete the duplicated rows that have same values, characters, strings,etc. on a specific variable, in this case ID,DATE and Dx.
> alt
ID RISK GENDER STATUS AGE DAY MONTH YEAR DATE SALA Dx DAY2 MONTH2 YEAR2 DATE2 Dx1 STATE1
1 740010662 5 2 2 23 29 12 2009 40176 13 Z33.X 1 1 2010 40179 O82.9 please
2 347866388 5 2 1 23 31 12 2009 40178 13 O06.4 1 1 2010 40179 O06.4 help
3 705280124 5 2 2 33 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 me
4 791125002 5 2 1 30 30 12 2009 40177 13 O33.5 1 1 2010 40179 O82.9 im
5 469833092 5 2 1 26 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 really
6 855318644 5 2 1 35 30 12 2009 40177 13 O47.9 1 1 2010 40179 O80.9 desperate
7 410886387 5 2 2 29 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 been
8 486326736 5 2 1 37 30 12 2009 40177 13 O72.0 1 1 2010 40179 O82.9 banging
9 995190824 5 2 1 22 1 1 2010 40179 13 Z33.X 1 1 2010 40179 O80.9 my
10 896565718 5 2 1 28 30 12 2009 40177 13 Z33.X 1 1 2010 40179 O82.9 head
11 347866388 5 2 1 23 31 12 2009 40178 13 O06.4 1 1 2010 40179 O06.4 to
12 855318644 5 2 1 35 30 12 2009 40177 13 O47.9 1 1 2010 40179 O80.9 the
13 995190824 5 2 1 22 1 1 2010 40179 13 Z33.X 1 1 2010 40179 O80.9 wall
First i loaded the csv
#1.Load database
BU<-read.csv(choose.files(),header = T,)
alt<- BU
Followed by this code
# 2.eliminates duplicates according to multiple variables
# df1<-data frame with unique observations
# df2<-dataframe with duplicated observations
df1 <- alt[ !( duplicated(alt[,1]) & duplicated(alt[,9]) & duplicated(alt[,11]) ), ]
df2 <- alt[ !(!( duplicated(alt[,1]) & duplicated(alt[,9]) & duplicated(alt[,11]) )), ]
and voila, it works, these being the results.
> df1
ID RISK GENDER STATUS AGE DAY MONTH YEAR DATE SALA Dx DAY2 MONTH2 YEAR2 DATE2 Dx1 STATE
1 740010662 5 2 2 23 29 12 2009 40176 13 Z33.X 1 1 2010 40179 O82.9 please
2 347866388 5 2 1 23 31 12 2009 40178 13 O06.4 1 1 2010 40179 O06.4 help
3 705280124 5 2 2 33 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 me
4 791125002 5 2 1 30 30 12 2009 40177 13 O33.5 1 1 2010 40179 O82.9 im
5 469833092 5 2 1 26 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 really
6 855318644 5 2 1 35 30 12 2009 40177 13 O47.9 1 1 2010 40179 O80.9 desperate
7 410886387 5 2 2 29 31 12 2009 40178 13 Z33.X 1 1 2010 40179 O80.9 been
8 486326736 5 2 1 37 30 12 2009 40177 13 O72.0 1 1 2010 40179 O82.9 banging
9 995190824 5 2 1 22 1 1 2010 40179 13 Z33.X 1 1 2010 40179 O80.9 my
10 896565718 5 2 1 28 30 12 2009 40177 13 Z33.X 1 1 2010 40179 O82.9 head
> df2
ID RISK GENDER STATUS AGE DAY MONTH YEAR DATE SALA Dx DAY2 MONTH2 YEAR2 DATE2 Dx1 STATE
11 347866388 5 2 1 23 31 12 2009 40178 13 O06.4 1 1 2010 40179 O06.4 to
12 855318644 5 2 1 35 30 12 2009 40177 13 O47.9 1 1 2010 40179 O80.9 the
13 995190824 5 2 1 22 1 1 2010 40179 13 Z33.X 1 1 2010 40179 O80.9 wall
which works. Since it deletes rows 11,12,and 13 which are duplicates of rows 2,6 and 9.
HOWEVER the original database is about 18,000 observations and 27 variables and there is where the code fails. It seems it considers duplicates based on only one or two variables not the three variables, ID,DATE and Dx.
I think you're looking for
alt[!duplicated(alt[c('ID','DATE','Dx')]),];
When given a data.frame, the duplicated()
function takes into account all columns in the data.frame when deciding which rows are duplicates.
But beware the caveat:
The data frame method works by pasting together a character representation of the rows separated by \r, so may be imperfect if the data frame has characters with embedded carriage returns or columns which do not reliably map to characters.
Response to your comment: The two are not equivalent. My code selects unique rows, while your code selects rows that have at least one column whose cell value does not duplicate a prior value in the column. Your code will fail to select rows whose every cell value is individually a duplicate of a higher-up value in the column, but whose row taken in its entirety is a unique row against all prior rows.
Here's a simple demonstration:
g <- expand.grid(x=1:3,y=4:6)[rep(1:9,2L),];
cbind(g,dennis=!duplicated(g$x) | !duplicated(g$y),bgoldst=!duplicated(g));
## x y dennis bgoldst
## 1 1 4 TRUE TRUE
## 2 2 4 TRUE TRUE
## 3 3 4 TRUE TRUE
## 4 1 5 TRUE TRUE
## 5 2 5 FALSE TRUE
## 6 3 5 FALSE TRUE
## 7 1 6 TRUE TRUE
## 8 2 6 FALSE TRUE
## 9 3 6 FALSE TRUE
## 1.1 1 4 FALSE FALSE
## 2.1 2 4 FALSE FALSE
## 3.1 3 4 FALSE FALSE
## 4.1 1 5 FALSE FALSE
## 5.1 2 5 FALSE FALSE
## 6.1 3 5 FALSE FALSE
## 7.1 1 6 FALSE FALSE
## 8.1 2 6 FALSE FALSE
## 9.1 3 6 FALSE FALSE
For example, note how the first differing row, x=2 y=5
, is dropped by your code, because x=2
is a duplicate and y=5
is a duplicate, but it is included by my code, because the row x=2 y=5
taken in its entirety is unique against all prior rows.