Search code examples
rlistdataframetmcorpus

Converting multiple dataframes within a list to their own unique corpus objects


I've split a large dataframe by levels of a particular column into a list of dataframes using split() and am now attempting to assign each dataframe into it's own corpus object using Corpus() function but am unable to obtain the desired result.

I've tried creating a list of random norms of the same length as my list of dataframes, renaming each element in the list of norms, converting each element in my list of dataframes to a corpus object and assigning each one to the re-named variables in the list of norms.

df <- data.frame("A" = 10:12, "B" = c(1, 1, 2)) # create example df

split_df <- split(df, f = df$B, drop = T) # split df by B col

names(split_df) <- c("df1", "df2") # rename dfs

split_df 

> split_df
$df1
   A B
1 10 1
2 11 1

$df2
   A B
3 12 2

y <- as.list(rnorm(length(split_df))) # create list of norms length of df list

names(y) <- paste("corpus", 1:length(y), sep="_") # rename elements of list

# iterate over list and assign same column of each df to individual corpus
for(i in 1:length(y)){
        y[i] <- Corpus(VectorSource(split_df[[i]]$A))
}

list2env(y, envir = .GlobalEnv)

Basically, I am expecting to be able to create multiple corpus' objects (as many as dataframes within list of dataframes) with their own unique names without having to type out the variable name + Corpus() code manually for each dataframe within a list of 104 dataframes.

# actual result:

y[1]

> y[1]
$corpus_1
[1] "10" "11"

# expected result:

works_1 <- Corpus(VectorSource(split_df[[1]]$A))
works_1

> works_1
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 2

How can I re-produce the above expected result, for 104 separate dfs within a list, each with their own name? I.e. (corpus_1, corpus_2, ... , corpus_104)?

Many thanks.


Solution

  • lapply is the way to go.

    library(tm)
    
    # create list of corpi 
    all_corps <- lapply(split_df, function(x) Corpus(VectorSource(x)))
    
    summary(all_corps)
        Length Class        Mode
    df1 2      SimpleCorpus list
    df2 2      SimpleCorpus list