Search code examples
pythonkerasnlptokenizecjk

Tokenizing Chinese text with keras.preprocessing.text.Tokenizer


keras.preprocessing.text.Tokenizer doesn't work correctly with Chinese text. How can I modify it to work on Chinese text?

from keras.preprocessing.text import Tokenizer
def fit_get_tokenizer(data, max_words):
    tokenizer = Tokenizer(num_words=max_words, filters='!"#%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
    tokenizer.fit_on_texts(data)
    return tokenizer
tokenizer = fit_get_tokenizer(df.sentence,max_words=150000)
print('Total number of words: ', len(tokenizer.word_index))
vocabulary_inv = {}
for word in tokenizer.word_index:
    vocabulary_inv[tokenizer.word_index[word]] = word
print(vocabulary_inv)

Solution

  • def fit_get_tokenizer(data, max_words):
        c=[]
        for i in range(len(data)):
            a = []
            text_tokens = re.findall(r'(.*?[?\ . \ !)。])\s?', data[i])
            for i, j in enumerate(text_tokens):
    
                seg_list = jieba.lcut(j, cut_all=False)
                sen = " ".join(seg_list)
                a.append(sen)
            for i in a:
                c.append(i)
        tokenizer = Tokenizer(num_words=max_words)
        tokenizer.fit_on_texts(c)
        return tokenizer ```
    
    If anyone going thorough Chinese text segmentation. I used regular expression to extract sentence from Chinese paragraph. Then I have used jieba(instead of NLTK) to get perfect word tokens and getting it ready keras Tokenizer.