I created a dataframe that looks like this:
# Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000146396 0.20 431801 3
ENSMUSG00000089809 ENSMUST00000161516 0.23 354036 2
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000117098 0.05 4400 2
ENSMUSG00000044681 ENSMUST00000141196 0.10 1118 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
Now I would like to select for each GeneId
the TrID
that has the higher PSI
value with the respective Ranking
. At the end the output will be like this:
# Desired Output Dataframe
GeneID TrID PSI Length Ranking
ENSMUSG00000089809 ENSMUST00000161148 0.57 5601 1
ENSMUSG00000044681 ENSMUST00000141601 0.75 44973 5
After that, I will create a distribution of the ranking
values and check in which PSI
value the rank corresponds. I will permute the Length
values and the TrID
values in order to perform a control of the distribution.
You can use base R and do:
byGeneId = split(1:nrow(Dataframe), Dataframe$GeneId)
whichTopPsi = sapply(byGeneId, function(i) i[which.max(Dataframe[i,'PSI'])])
Dataframe[whichTopPsi,]
You could also use ddply
, which is more general.
require(plyr)
ddply(Dataframe, "GeneId", function(d) d[which.max(d[,'PSI']),,drop=FALSE])