Can you add custom tokens to tokenizer (Chinese language) in Quanteda?

Does anybody know if it is possible to add in custom tokens after texts have been tokenized in Quanteda?

I am trying to do some analysis of Chinese language texts, but the tokenizer doesn't recognise the abbreviation for ASEAN "东盟” as a single word (see eg below).

Or alternatively, are there any alternative tokenizers for Chinese language texts that work with Quanteda. I had been using the Spacyr package, but cannot get that working at the moment.

I had made some functions to use the 'Feature co-occurrence matrix' to count the numbers of times other words appears within a 5-word window of a particular term, then to produce a table of these results (see below). However, this doesn't seem to work for the term "东盟”


##Function 1 

get_fcm <- function(data) {
  ch_stop <- stopwords("zh", source = "misc")
  corp = corpus(data)
  toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop) 
  fcm = fcm(toks, context = "window")
  return(fcm)
}

##Function 2

convert2df <- function(matrix, term){
  mat_term = matrix[term,]
  df = convert((t(mat_term)), to = "data.frame")
  colnames(df)[1] = "CoTerm"
  colnames(df)[2] = "Freq"
  x = df[order(-df$Freq),]
  return(x)
}

Would adding %>% tokens_compound(phrase("东盟"), concatenator = "") to the toks = line of Function 1 resolve this?

Solution

You can post-process the split phrases such as "东盟" to rejoin them after tokenising, if you have a specific list.

> tokens("东盟") %>%
+     tokens_compound(phrase("东 盟"), concatenator = "")
Tokens consisting of 1 document.
text1 :
[1] "东盟"