Search code examples
rtext-miningtmcorpus

How can I create a term matrix that sums numeric values associated to each document?


I'm a bit new to R and tm so struggling with this exercise!

I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.

**Description   Sold Items**
Mrs White London UK 10
Mr Wolf London UK   20
Tania Maier Berlin Germany  10
Thomas Germany  30
Nick Forest Leeds UK    20
Silvio Verdi Italy Torino   10
Tom Cardiff UK  10
Mary House London   5

Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).

         UK London  Germany …   Mary
Frequency   4   3   2   …   1

However, I would also like to sum the total amount of sold items.

The desired output should be:

         UK London  Germany …   Mary
Frequency   4   3   2   …   1
Sum of Sold Items   60  35  40  …   5

How can I get to this result?


Solution

  • Assuming you can get to the stage where you have the Frequency table:

               UK London  Germany …   Mary
    Frequency   4   3   2   …   1
    

    and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:

    S_data<-read.csv("data.csv",stringsAsFactors = F)
    
    Words<-c("UK","London","Germany","Mary")
    

    Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:

    string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
    
    string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
    > string_sum
         UK  London Germany    Mary 
         60      35      40       5 
    

    Just bind this onto your frequency table