AS FLOAT64 takes more memory,which is the default data type of the tokenized matrix,I want it to be in INT8 ,thus saving space.
This is the method I'm talking,
texts_to_matrix(texts):
Return: numpy array of shape (len(texts), num_words).
Arguments:
texts: list of texts to vectorize.
mode: one of "binary", "count", "tfidf", "freq" (default: "binary").
Taking a look at the source code, the result matrix is created here using np.zeros()
with no dtype
keyword argument which would result in dtype
being set to default value set in function definition which is float
. I think the choice of this data type is made to support all forms of transformation like tfidf
which results in non-integer output.
So I think you have to options:
1. Change the source code
You can change add a keyword argument to definition of texts_to_matrix
like dtype
and change the line where matrix is created to
x = np.zeros((len(sequences), num_words), dtype=dtype)
2.Use another tool for preprocessing: You can preprocess your text using another tool and then feed it to keras network. For example you can use scikit learn's CountVectorizer like:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(dtype=np.int8, ...)
matrix = cv.fit_transform(texts).toarray()