I'm getting a numpy.core._exceptions._ArrayMemoryError
when I try to use scikit-learn
's DictVectorizer
on my data.
I'm using Python 3.9
in PyCharm
on Windows 10
, and my system has 64 GB of RAM.
I'm pre-processing text data for training a Keras POS-tagger. The data starts in this format, with lists of tokens for each sentence:
sentences = [['Eorum', 'fines', 'Nervii', 'attingebant'], ['ait', 'enim'], ['scriptum', 'est', 'enim', 'in', 'lege', 'Mosi'], ...]
I then use the following function to extract useful features from the dataset:
def get_word_features(words, word_index):
"""Return a dictionary of important word features for an individual word in the context of its sentence"""
word = words[word_index]
return {
'word': word,
'sent_len': len(words),
'word_len': len(word),
'first_word': word_index == 0,
'last_word': word_index == len(words) - 1,
'start_letter': word[0],
'start_letters-2': word[:2],
'start_letters-3': word[:3],
'end_letter': word[-1],
'end_letters-2': word[-2:],
'end_letters-3': word[-3:],
'previous_word': '' if word_index == 0 else words[word_index - 1],
'following_word': '' if word_index == len(words) - 1 else words[word_index + 1]
}
word_dicts = list()
for sentence in sentences:
for index, token in enumerate(sentence):
word_dicts.append(get_word_features(sentence, index))
In this format the data isn't very large. It seems to only be about 3.3 MB.
Next I setup DictVectorizer
, fit it to the data, and try to transform the data with it:
from sklearn.feature_extraction import DictVectorizer
dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
X_train = dict_vectoriser.transform(word_dicts)
At this point I'm getting this error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 499. GiB for an array with shape (334043, 200643) and data type float64
This seems to suggest that DictVectorizer
has massively increased the size of the data, to nearly 500 GB. Is this normal? Should the output really take up this much memory, or am I doing something wrong?
I looked for solutions and in this thread someone suggested allocating more virtual memory by going into Windows settings and SystemPropertiesAdvanced
, unchecking Automatically manage paging file size for all drives
, then manually setting the paging file size to a sufficiently large amount. This would be fine if the task required about 100 GB, but I don't have enough storage to allocate 500 GB to the task.
Is there any solution for this? Or do I just need to go and buy myself a larger drive just to have a big enough pagefile? This seems impractical, especially when the initial dataset wasn't even particularly large.
I worked out a solution. In case it's useful to anybody, here it is. I had been using a data generator later in my workflow, just to feed data to the GPU for processing in batches.
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return batch_x, batch_y
Based on the comments I got here I originally tried updating the output here to return batch_x.todense()
and changing my code above so that dict_vectoriser = DictVectorizer(sparse=True)
. As I mentioned in the comments, though, this didn't seem to work.
I've now changed the generator so that, once the dict_vectoriser
is created and fitted to the data, it's passed as an argument to the data generator, and it's not called to transform the data until the generator is being used.
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size, x_vec):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
self.x_vec = x_vec
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x_vec.transform(self.x[idx * self.batch_size:(idx + 1) * self.batch_size])
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return batch_x, batch_y
To call it you need to set the batch_size
and provide labels, so below y_train
is some encoded list of labels corresponding to the x_train
data.
dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
train_generator = DataGenerator(word_dicts, y_train, 200, dict_vectoriser)