I have a column comprised of rows all of which contain many different elements separated by a ",". This multi-valued attribute column (CANDIDATE_SECTORS) contains many elements in each cell (AE, AE, PPP, LSE, D, G, D, AE) separated by a comma. as shown here:
PROJEX_ID <- c("1","2","3","4")
NAME <- c("Hi", "go", "the", "dart")
CANDIDATE_SECTORS <- c("AE, PPP, AE, D", "PPP, D, G, D", "LSE, PPP, PPP", "G, D, H, D, G, G")
PROJ_DATA <- data.frame(PROJEX_ID, NAME, CANDIDATE_SECTORS)
PROJ_DATA$SECTOR_CAPABLE <- ""
I need to count the number of each element inside each individual cell and then store the value occurring most frequently on a row by row basis. (e.g. AE for row 1, D for row 2) etc. Any help would be greatly appreciated.
An example of a desired result, is shown in the SECTOR_CAPABLE column, where "AE" is stored in the 1st row of the SECTOR_CAPABLE column because it has appeared twice in that rows' CANDIDATE_SECTORS cell.
I am relatively new to R, and the code below is what i am trying; it produces a uniform answer of "AE", as far as I understand it, that is is because the for loop is computing the number of times an element is present throughout all the cells in the entire column,.
for(i in 1:nrow(PROJ_DATA)){
CANDIDATE_PROJS_UNLISTED <- setNames(strsplit(as.character(PROJ_DATA$CANDIDATE_SECTORS), ","), PROJ_DATA$PROJEX_ID)
CANDIDATE_PROJS_UNLISTED_TMP <- stack(CANDIDATE_PROJS_UNLISTED)
CANDIDATE_PROJS_UNLISTED_TMP$ORDER <- ave(CANDIDATE_PROJS_UNLISTED_TMP$ind, CANDIDATE_PROJS_UNLISTED_TMP$ind, FUN = seq_along)
candidate_projs_unlisted_tmp <- as.data.frame(table(CANDIDATE_PROJS_UNLISTED_TMP),decreasing=T)
PROJ_DATA$SECTOR_CAPABLE <- head(CANDIDATE_PROJS_UNLISTED$values,1)
}
Split on comma, get counts, get the most frequent:
# example data
df1 <- data.frame(
x = c("c,a,b,b",
"aa,bb,b,c,c,a",
"d,d,b,b"), stringsAsFactors = FALSE)
# get most frequent as new column
df1$res <- sapply(df1$x, function(i) {
z <- unlist(strsplit(i, "," ))
names(sort(table(z), decreasing = TRUE)[1])
})
# result
df1
# x res
# 1 c,a,b,b b
# 2 aa,bb,b,c,c,a c
# 3 d,d,b,b b
Note: you need to decide what to do when there is a tie. At the moment result is sorted alphabetically so on 3rd row we are getting b
instead of d
.