Search code examples
rsplitcountdelimiterelement

Retrieve most common element in a multivalued attribute split by delimiter


I have a column comprised of rows all of which contain many different elements separated by a ",". This multi-valued attribute column (CANDIDATE_SECTORS) contains many elements in each cell (AE, AE, PPP, LSE, D, G, D, AE) separated by a comma. as shown here:

 PROJEX_ID <- c("1","2","3","4")
 NAME <- c("Hi", "go", "the", "dart")
 CANDIDATE_SECTORS <- c("AE, PPP, AE, D", "PPP, D, G, D", "LSE, PPP, PPP", "G, D, H, D, G, G")
 PROJ_DATA <- data.frame(PROJEX_ID, NAME, CANDIDATE_SECTORS)
 PROJ_DATA$SECTOR_CAPABLE <- ""

enter image description here

I need to count the number of each element inside each individual cell and then store the value occurring most frequently on a row by row basis. (e.g. AE for row 1, D for row 2) etc. Any help would be greatly appreciated.

An example of a desired result, is shown in the SECTOR_CAPABLE column, where "AE" is stored in the 1st row of the SECTOR_CAPABLE column because it has appeared twice in that rows' CANDIDATE_SECTORS cell.

I am relatively new to R, and the code below is what i am trying; it produces a uniform answer of "AE", as far as I understand it, that is is because the for loop is computing the number of times an element is present throughout all the cells in the entire column,.

for(i in 1:nrow(PROJ_DATA)){ 
  CANDIDATE_PROJS_UNLISTED <- setNames(strsplit(as.character(PROJ_DATA$CANDIDATE_SECTORS), ","), PROJ_DATA$PROJEX_ID) 
  CANDIDATE_PROJS_UNLISTED_TMP <- stack(CANDIDATE_PROJS_UNLISTED) 
  CANDIDATE_PROJS_UNLISTED_TMP$ORDER <- ave(CANDIDATE_PROJS_UNLISTED_TMP$ind, CANDIDATE_PROJS_UNLISTED_TMP$ind, FUN = seq_along) 
  candidate_projs_unlisted_tmp <- as.data.frame(table(CANDIDATE_PROJS_UNLISTED_TMP),decreasing=T) 
  PROJ_DATA$SECTOR_CAPABLE <- head(CANDIDATE_PROJS_UNLISTED$values,1) 
}

Solution

  • Split on comma, get counts, get the most frequent:

    # example data
    df1 <- data.frame(
      x = c("c,a,b,b",
            "aa,bb,b,c,c,a",
            "d,d,b,b"), stringsAsFactors = FALSE)
    
    # get most frequent as new column
    df1$res <- sapply(df1$x, function(i) {
      z <- unlist(strsplit(i, "," ))
      names(sort(table(z), decreasing = TRUE)[1])
    })
    
    # result
    df1
    #               x res
    # 1       c,a,b,b   b
    # 2 aa,bb,b,c,c,a   c
    # 3       d,d,b,b   b
    

    Note: you need to decide what to do when there is a tie. At the moment result is sorted alphabetically so on 3rd row we are getting b instead of d.