Consider this example
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
# A tibble: 2 x 2
text repetition
<chr> <dbl>
1 a grande latte with soy milk 100
2 black coffee no room 2
The data means the the sentence a grande latte with soy milk
appears 100 times in my dataset. Of course, it is a waste of memory to store that redundancy and this is why I have the repetition
variable.
Still, I would like to have the dtm
from quanteda to reflect that because the sparseness of the dfm gives me some room to keep that information. That is, how can I still have 100 rows for the first text in the dfm? Just using the following code does not take repetition
into account
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2)) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1
Supposing your data.frame
is called df1, you can use cbind
to add a column to the dfm. But that might not give you the required result. The other two options below are probably better.
cbind
df1 <- tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
my_dfm <- df1 %>%
corpus() %>%
tokens() %>%
dfm() %>%
cbind(repetition = df1$repetition) # add column to dfm with name repetition
Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room repetition
text1 1 1 1 1 1 1 0 0 0 0 100
text2 0 0 0 0 0 0 1 1 1 1 2
docvars
You can also add data via the docvars
function, then the data is added to the dfm but a bit more hidden in the dfm-class slots (reachable with @).
docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)
repetition
text1 100
text2 2
multiplication
Using multiplication:
my_dfm * df1$repetition
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 100 100 100 100 100 100 0 0 0 0
text2 0 0 0 0 0 0 2 2 2 2