I have a dataframe in R that contains several duplicate organizations. I would like to create a new dataframe where each organization is only listed once.
OrgName<-c("Org1","Org2","Org3","Org2","Org2","Org3")
ProgramName<-c("P1","P2","P2","P1","P4","P3")
OrgType<-c("Consulting","Education","Government","Government","Consulting","Education")
df<-data.frame(OrgName,ProgramName,OrgType)
df2<-data.frame(matrix(NA,ncol=ncol(df),nrow=length(unique(df$OrgName))))
OrgName ProgramName OrgType
1 Org1 P1 Consulting
2 Org2 P2 Education
3 Org3 P2 Government
4 Org2 P1 Government
5 Org2 P4 Consulting
6 Org3 P3 Education
I was able to this above for the first column. But in order to populate the subsequent columns I need to indicate which Program Name and Org Type to display since there is more than one value.
I would like to say that in cases where an organization is part of P1, insert use the name of program 1 in the ProgramName
Field and the associated Org Typ in the OrgType
field. If an organization is not part of P1, then insert use the name of P2 in the ProgramName
Field and the associated Org Typ in the OrgType
field. The seuqence would then prioritize P3 and then P4.
Using data.table
, convert to data.table
with setDT
, grouped by 'OrgName', order
the 'OrgName', 'ProgramName' and get the first row of the Subset of Data.table (.SD
)
library(data.table)
setDT(df)[order(OrgName, ProgramName), head(.SD, 1), .(OrgName)]
# OrgName ProgramName OrgType
#1: Org1 P1 Consulting
#2: Org2 P1 Government
#3: Org3 P2 Government