Search code examples
rdatasetuniquepriority-queue

In R create dataset of unique values and prioritize elements of duplicates


I have a dataframe in R that contains several duplicate organizations. I would like to create a new dataframe where each organization is only listed once.

OrgName<-c("Org1","Org2","Org3","Org2","Org2","Org3")
ProgramName<-c("P1","P2","P2","P1","P4","P3")
OrgType<-c("Consulting","Education","Government","Government","Consulting","Education")

df<-data.frame(OrgName,ProgramName,OrgType)

df2<-data.frame(matrix(NA,ncol=ncol(df),nrow=length(unique(df$OrgName))))

  OrgName ProgramName    OrgType
1    Org1          P1 Consulting
2    Org2          P2  Education
3    Org3          P2 Government
4    Org2          P1 Government
5    Org2          P4 Consulting
6    Org3          P3  Education

I was able to this above for the first column. But in order to populate the subsequent columns I need to indicate which Program Name and Org Type to display since there is more than one value.

I would like to say that in cases where an organization is part of P1, insert use the name of program 1 in the ProgramName Field and the associated Org Typ in the OrgType field. If an organization is not part of P1, then insert use the name of P2 in the ProgramName Field and the associated Org Typ in the OrgType field. The seuqence would then prioritize P3 and then P4.


Solution

  • Using data.table, convert to data.table with setDT, grouped by 'OrgName', order the 'OrgName', 'ProgramName' and get the first row of the Subset of Data.table (.SD)

    library(data.table)
    setDT(df)[order(OrgName, ProgramName), head(.SD, 1), .(OrgName)]
    #   OrgName ProgramName    OrgType
    #1:    Org1          P1 Consulting
    #2:    Org2          P1 Government
    #3:    Org3          P2 Government