I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?
Assuming you can get to the stage where you have the Frequency
table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep
. Here I will create a vector which represents your dictionary you extract from your frequency
table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency
table