I am trying to create a dfm of letters from strings. I am facing issues when the dfm is unable to pick on can create features for punctuations such as "/" "-" "." or '.
require(quanteda)
dict = c('a','b','c','d','e','f','/',".",'-',"'")
dict <- quanteda::dictionary(sapply(dict, list))
x<-c("cab","baa", "a/de-d/f","ad")
x<-sapply(x, function(x) strsplit(x,"")[[1]])
x<-sapply(x, function(x) paste(x, collapse = " "))
mat <- dfm(x, dictionary = dict, valuetype = "regex")
mat <- as.matrix(mat)
mat
The problem (as @lukeA points out in a comment) is that your valuetype
is using the wrong pattern match. You are using a regular expression where the .
stands for any character, and hence here is getting you a total (what you call a rowsum).
Let's first look at x
, which will be tokenised on the whitespace by dfm()
, so that each character becomes a token.
x
# cab baa a/de-d/f ad
# "c a b" "b a a" "a / d e - d / f" "a d"
To answer (2) first, you are getting the following with a "regex" match:
dfm(x, dictionary = dict, valuetype = "regex", verbose = FALSE)
## Document-feature matrix of: 4 documents, 10 features.
## 4 x 10 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f / . - '
## cab 1 1 1 0 0 0 0 3 0 0
## baa 2 1 0 0 0 0 0 3 0 0
## a/de-d/f 1 0 0 2 1 1 0 5 0 0
## ad 1 0 0 1 0 0 0 2 0 0
That's close, but does not answer (1). To solve that, you need to alter the default tokenisation behaviour by dfm()
so that it does not remove punctuation.
dfm(x, dictionary = dict, valuetype = "fixed", removePunct = FALSE, verbose = FALSE)
## Document-feature matrix of: 4 documents, 10 features.
## 4 x 10 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f / . - '
## cab 1 1 1 0 0 0 0 0 0 0
## baa 2 1 0 0 0 0 0 0 0 0
## a/de-d/f 1 0 0 2 1 1 2 0 1 0
## ad 1 0 0 1 0 0 0 0 0 0
and now the /
and -
are being counted. The .
and '
remain present as features because they were dictionary keys, but have a zero count for every document.