Search code examples
rdata-cleaning

Aggregating bag of words vectors by word for many files


I currently have a list of vectors where each number in the list represents the count of a different word in the file.

I would like to change this list to be a dataframe where the rownames are the filenames, and columns are the words (sorted alphabetically with only one column per word), and each observation the count of a certain word, where all words used in any file is included (i.e. if file a includes a word that file b does not include, then the count of the word in file b is 0).

So essentially the current code right now looks like:


file1 <- c(1,5,7,2)
names(file1) <- c("a", "by", "her", "the")

file2 <- c(10,5,2)
names(file2) <- c("a", "and", "to")

list(file1, file2)

What I would like is:


df <- data.frame(matrix(nrow=2, ncol=6, byrow=T, data=c(1, 0, 5, 7, 2, 0,
                                                        10, 5, 0,0,0,2)))
colnames(df) <- c("a", "and", "by", "her", "the", "to")
rownames(df) <- c("file1", "file2")
df


Thanks.


Solution

  • The fill argument of rbindlist function from data.table package can come in handy here.

    library(data.table)
    
    nm = c("file1", "file2")
    d = rbindlist(lapply(mget(nm), function(x) data.frame(t(x))), fill = TRUE)
    d = as.data.frame(d)
    row.names(d) = nm
    d
    #       a by her the and to
    #file1  1  5   7   2  NA NA
    #file2 10 NA  NA  NA   5  2
    

    To reorder d and replace NA with 0, further steps are necessary

    d = d[,order(colnames(d))]
    d = replace(d, is.na(d), 0)