Search code examples
rdelimiterstrsplit

delimiting a column based on the number of delimiters within that column


I have a vector list.exp2 where each entry is one or more strings separated by commas. I would like to split each entry and take the first n number of strings based on the number of delimiters present in that entry.

I've tried the below code but am not successful yet

refined.final.list <- as.vector(sapply(list.exp2, function(n)
         ifelse(count.fields(textConnection(list.exp2[n]), sep = ",") < 3,
                unlist(strsplit(list.exp2[n], ","))[1],
                count.fields(textConnection(list.exp2[n]), sep = ",") < 5, 
                unlist(strsplit(list.exp2[n], ","))[1:2],
                unlist(strsplit(list.exp2[n], ","))[1:4])))

Basically, I used the ifelse along with the count function that gives me a count of the number of delimiters and the unlist function is suppose to give me corresponding split elements.

The list.exp2 vector looks like this

lis.exp2 <- c("ISTITUTO PER LA SINTESI ORGANICA E LA FOTOREATTIVITÀ (ISOF-CNR), 
               SEZIONE DI FERRARA, VIA L. BORSARI 46, 44100 FERRARA, ITALY",
              "FLUXOME SCIENCES A/S, SØLTOFTS PLADS, BUILDING 223, DK-2800 KGS. LYNGBY, DENMARK",
              "FERDINAND-BRAUN-INSTITUT FÜR HÖCHSTFREQUENZTECHNIK, GUSTAV-KIRCHHOFF-STR. 4, 12489 BERLIN, GERMANY") 

Any insights into how to correct this code will be greatly appreciated.


Solution

  • One option could be to use strsplit directly on your vector lis.exp2. It will result into a list with one item for each item from vector. Then use lapply to return desired number of element.

    Example to return 3 items as:

    n <- 3
    lapply(strsplit(lis.exp2, split=","), function(x)x[1:n])
    
    #OR Based on @thelatemail suggestion
    
    lapply(strsplit(lis.exp2, split=","), head, n)
    
    #Result
    # [[1]]
    # [1] "ISTITUTO PER LA SINTESI ORGANICA E LA FOTOREATTIVITÀ (ISOF-CNR)"
    # [2] " SEZIONE DI FERRARA"                                            
    # [3] " VIA L. BORSARI 46"                                             
    # 
    # [[2]]
    # [1] "FLUXOME SCIENCES A/S" " SØLTOFTS PLADS"      " BUILDING 223"       
    # 
    # [[3]]
    # [1] "FERDINAND-BRAUN-INSTITUT FÜR HÖCHSTFREQUENZTECHNIK"
    # [2] " GUSTAV-KIRCHHOFF-STR. 4"                          
    # [3] " 12489 BERLIN"    
    

    **UPDATED:**Based on feedback from OP A function can be written which check if number of items less than (say 4) then return only 1st items else return top 3 items.

    #Function to return top 1/3 items based on condition
    getNItems <- function(x){
      if(length(x) <= 4){
        #only 1st
        x[1]
      }else{
        #first 3
        x[1:3]
      }
    }                                 
    lapply(strsplit(lis.exp2, split=","), getNItems)