Search code examples
pythonkerasnlpdata-sciencetokenize

How to add known words tokenizer keras python?


I want to convert text to sequence using keras with indonesian languages. but the keras tokenizer only detect the known word.

How to add known words in keras? or any solution for me to convert text to sequence?

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(concated['TITLE'].values)
txt = ["bisnis di indonesia sangat maju"]
seq = list(tokenizer.texts_to_sequences_generator(txt))

the "seq" variable resulting empty array if i used indonesian languages, its work perfectly if i used the english word. how to use keras for different languages? or anyway to add some known word to keras?

Thanks


Solution

  • Keras doesn't know any languages or words. You create the vocabulary using the fit_on_texts or fit_on_sequences method.

    I guess you are fitting the tokenizer on some English text (i.e., concated['TITLE'].values). As a result, the internal vocabulary contains only English words (and no Indonesian words). This explains why seq will be empty if txt only contains non-English words.

    Also, you can take a look at the source code of the Tokenizer class.