Search code examples
pythontextkerasnltktokenize

(To prevent Memory Error)How to one hot encode word list to a matrix of INTEGER 8 in Keras using Tokenize class


AS FLOAT64 takes more memory,which is the default data type of the tokenized matrix,I want it to be in INT8 ,thus saving space.

link to documentation

This is the method I'm talking,

texts_to_matrix(texts):
Return: numpy array of shape (len(texts), num_words).
Arguments:
    texts: list of texts to vectorize.
    mode: one of "binary", "count", "tfidf", "freq" (default: "binary").

Solution

  • Taking a look at the source code, the result matrix is created here using np.zeros() with no dtype keyword argument which would result in dtype being set to default value set in function definition which is float. I think the choice of this data type is made to support all forms of transformation like tfidf which results in non-integer output. So I think you have to options:

    1. Change the source code You can change add a keyword argument to definition of texts_to_matrix like dtype and change the line where matrix is created to

    x = np.zeros((len(sequences), num_words), dtype=dtype)
    

    2.Use another tool for preprocessing: You can preprocess your text using another tool and then feed it to keras network. For example you can use scikit learn's CountVectorizer like:

    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(dtype=np.int8, ...)
    matrix = cv.fit_transform(texts).toarray()