Search code examples
rdataframeaggregatefrequencytext-mining

Frequency of strings and their IDs in a dataframe using R


The goal is to generate the frequency of a text variable and associate the corresponding IDs with it.

Suppose Sample is a dataframe as shown below:

Sample <- data.frame(ID = c('1', '2', '3', '4', '5', '6'), 
                        Var = c('How are you', 
                                 'Do not go', 
                                 'How are you', 
                                 'Please go',  
                                 'How are you',
                                 'Do not go'))

The following command generates the frequency of the strings in the column Var as follows:

as.data.frame(table(unlist(strsplit(tolower(Sample$Var), ', '))))

enter image description here

Is there a way to generate the associated IDs together in the table, say as?:

enter image description here


Solution

  • Base R solution:

    data.frame(do.call(rbind, lapply(with(Sample, split(Sample, Var)), function(x){
          with(x, data.frame(Var = unique(Var), Freq = nrow(x), ID = toString(ID)))
       }
      )
    ), row.names = NULL, stringsAsFactors = FALSE)