r function dataframe subset cluster-analysis

Running a clustering function on subsets of dataframe and writing the results for each subset in the dataframe

The data

I have a dataframe in R with the following sort of structure:

ID   group           text
100    1    An apple is a sweet, edible fruit produced by an apple tree.
103    1    An apple is a sweet, edible fruit produced by an apple tree.
105    1    Some dog breeds show more variation in size than other dog breeds.
106    1    An apple is a sweet, edible fruit produced by an apple tree.
107    1    An apple is a sweet, edible fruit produced by an apple tree.
209    1    Some dog breeds show more variation in size than other dog breeds.
300    1    Some dog breeds show more variation in size than other dog breeds.
501    1    An apple is a sweet, edible fruit produced by an apple tree.
503    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711    2    Pizza is a savory dish of Italian origin.
799    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.

Which can be reproduced with this code:

test_df <- data.frame(
  "ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
  "group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
  "text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)

In reality, the texts dealing with each topic are slightly different from one another, and there are several hundred thousand of them, spread across dozens of groups.

What I'm trying to do

I'm trying to write a function that does the following:

For each group in the dataframe, compare all the texts within that group, and identify the main lexical topics.
Then, input the relevant topic for each text as a new column in the dataframe.

Here is an example of what two rows in the dataframe might look like after the analysis:

ID   group    topic           text
100    1      apple    An apple is a sweet, edible fruit produced by an apple tree.
105    1       dog     Some dog breeds show more variation in size than other dog breeds.

What I have so far

I can use the following code to run this sort of function in general, on a full dataframe (without subsetting by group):

# Preparing the texts

library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)

# Identifying topics

library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')

# Inputting the topic classification into the dataframe

test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])

I've tried to turn this into a function, and then run the function on the dataframe by subset, using the following code:

library(tm)
library(topicmodels)

topic_identifier <- function(text) {
  corpus <- Corpus(VectorSource(text))
    corpus <- tm_map(corpus, removeWords, stopwords('english'))
    corpus <- tm_map(corpus, stemDocument, language = 'english')
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
  TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
      lda.output <- LDA(TF, k=2, method = 'Gibbs')
  test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
    }

by(test_df$text, test_df$group, topic_identifier)

But this doesn't allow me to save the relevant output for each subset in the original df.

Solution

The by function works best when it receives a data frame as input not the column vector, text. Then, you can manipulate this data frame for return not original test_df. Essentially, same exact process on whole data frame is retained to be run on subset data frames.

Also, you need to assign the result with <- to build a list of objects which can even be binded together at the end with do.call + rbind (assuming each data frame maintains same number and names of columns):

topic_identifier <- function(sub_df) { 
   corpus <- Corpus(VectorSource(sub_df$text)) 
   corpus <- tm_map(corpus, removeWords, stopwords('english')) 
   corpus <- tm_map(corpus, stemDocument, language = 'english') 
   corpus <- tm_map(corpus, removePunctuation) 
   corpus <- tm_map(corpus, stripWhitespace) 

   TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
   lda.output <- LDA(TF, k=2, method = 'Gibbs') 

   sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)]) 
   return(sub_df)
} 

# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)

# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))