I have the example data frame
test.df<-data.frame(id=c("A","A","A","B","B","B"), time=c(1:3,1:3), x1=c(1,1,1,2,2,2), x2=c("A","A","A","B","B","B"))
x1
and x2
variables are the same within each id
I would like to aggregate the above data frame to obtain the following
target.df<-data.frame(id=c("A","B"), x1=c(1,2), x2=c("A","B"))
In a sense I wish to aggregate
without any FUN
. I tried with FUN=unique
but does not seem to work. My original dataframe has 1 million rows and thousands x1,x2....
variables of different type (character, dates etc) but are the same within each ID. This is the same as a pivot table in excel
Many thanks
The problem as you state seems to be removing duplicate rows from a data.frame
and this does not require any aggregation. Based on your example this is what you're after:
unique(test.df[c(1,3,4)])
# id x1 x2
#1 A 1 A
#4 B 2 B
I don't quite get as to what do you mean by:
"I tried with
FUN=unique
but does not seem to work."
Just for the sake of explaining as to what you might have gotten with aggregate
wrong, here, I show how one could get the same with aggregate
:
test.df$x2 <- as.character(test.df$x2)
aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] )
# id x1 x2
#1 A 1 A
#2 B 2 B
However, there is no need to use aggregate()
here. It's terribly inefficient for this problem. You can check it out with system.time(.)
which already gives a difference even on this data:
system.time(unique(test.df[c(1,3,4)]))
# user system elapsed
# 0.001 0.000 0.001
system.time(aggregate(. ~ id, FUN=unique , data = test.df[c(1,3,4)] ))
# user system elapsed
# 0.004 0.000 0.004
Go ahead and run this on your million rows and check your results with identical
and have a look at the run time.
From your comments I think you're confused with the behaviour of unique
. As @mnel explains, it (unique.data.frame)
removes all duplicate rows alone from the given data.frame
. It works for your case because you say that x1
and x2
will have the same values for each ID
. So, you dont have to know where in the data.frame
ID
is. You just have to pick 1 row for each ID.