The data
I have a dataframe in R with the following sort of structure:
ID group text
100 1 An apple is a sweet, edible fruit produced by an apple tree.
103 1 An apple is a sweet, edible fruit produced by an apple tree.
105 1 Some dog breeds show more variation in size than other dog breeds.
106 1 An apple is a sweet, edible fruit produced by an apple tree.
107 1 An apple is a sweet, edible fruit produced by an apple tree.
209 1 Some dog breeds show more variation in size than other dog breeds.
300 1 Some dog breeds show more variation in size than other dog breeds.
501 1 An apple is a sweet, edible fruit produced by an apple tree.
503 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711 2 Pizza is a savory dish of Italian origin.
799 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
Which can be reproduced with this code:
test_df <- data.frame(
"ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
"group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
"text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)
In reality, the texts dealing with each topic are slightly different from one another, and there are several hundred thousand of them, spread across dozens of groups.
What I'm trying to do
I'm trying to write a function that does the following:
Here is an example of what two rows in the dataframe might look like after the analysis:
ID group topic text
100 1 apple An apple is a sweet, edible fruit produced by an apple tree.
105 1 dog Some dog breeds show more variation in size than other dog breeds.
What I have so far
I can use the following code to run this sort of function in general, on a full dataframe (without subsetting by group):
# Preparing the texts
library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# Identifying topics
library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
# Inputting the topic classification into the dataframe
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
I've tried to turn this into a function, and then run the function on the dataframe by subset, using the following code:
library(tm)
library(topicmodels)
topic_identifier <- function(text) {
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
}
by(test_df$text, test_df$group, topic_identifier)
But this doesn't allow me to save the relevant output for each subset in the original df.
The by
function works best when it receives a data frame as input not the column vector, text. Then, you can manipulate this data frame for return not original test_df. Essentially, same exact process on whole data frame is retained to be run on subset data frames.
Also, you need to assign the result with <-
to build a list of objects which can even be binded together at the end with do.call
+ rbind
(assuming each data frame maintains same number and names of columns):
topic_identifier <- function(sub_df) {
corpus <- Corpus(VectorSource(sub_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)])
return(sub_df)
}
# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)
# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))