Does anybody know if it is possible to add in custom tokens after texts have been tokenized in Quanteda?
I am trying to do some analysis of Chinese language texts, but the tokenizer doesn't recognise the abbreviation for ASEAN "东盟” as a single word (see eg below).
Or alternatively, are there any alternative tokenizers for Chinese language texts that work with Quanteda. I had been using the Spacyr package, but cannot get that working at the moment.
I had made some functions to use the 'Feature co-occurrence matrix' to count the numbers of times other words appears within a 5-word window of a particular term
, then to produce a table of these results (see below). However, this doesn't seem to work for the term "东盟”
##Function 1
get_fcm <- function(data) {
ch_stop <- stopwords("zh", source = "misc")
corp = corpus(data)
toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop)
fcm = fcm(toks, context = "window")
return(fcm)
}
##Function 2
convert2df <- function(matrix, term){
mat_term = matrix[term,]
df = convert((t(mat_term)), to = "data.frame")
colnames(df)[1] = "CoTerm"
colnames(df)[2] = "Freq"
x = df[order(-df$Freq),]
return(x)
}
Would adding %>% tokens_compound(phrase("东 盟"), concatenator = "")
to the toks = line of Function 1 resolve this?
You can post-process the split phrases such as "东盟" to rejoin them after tokenising, if you have a specific list.
> tokens("东盟") %>%
+ tokens_compound(phrase("东 盟"), concatenator = "")
Tokens consisting of 1 document.
text1 :
[1] "东盟"