numpy memory scikit-learn out-of-memory dictvectorizer

How can I avoid this Numpy ArrayMemoryError when using scikit-learn's DictVectorizer on my data?

I'm getting a numpy.core._exceptions._ArrayMemoryError when I try to use scikit-learn's DictVectorizer on my data.

I'm using Python 3.9 in PyCharm on Windows 10, and my system has 64 GB of RAM.

I'm pre-processing text data for training a Keras POS-tagger. The data starts in this format, with lists of tokens for each sentence:

sentences = [['Eorum', 'fines', 'Nervii', 'attingebant'], ['ait', 'enim'], ['scriptum', 'est', 'enim', 'in', 'lege', 'Mosi'], ...]

I then use the following function to extract useful features from the dataset:

def get_word_features(words, word_index):
    """Return a dictionary of important word features for an individual word in the context of its sentence"""
    word = words[word_index]
    return {
        'word': word,
        'sent_len': len(words),
        'word_len': len(word),
        'first_word': word_index == 0,
        'last_word': word_index == len(words) - 1,
        'start_letter': word[0],
        'start_letters-2': word[:2],
        'start_letters-3': word[:3],
        'end_letter': word[-1],
        'end_letters-2': word[-2:],
        'end_letters-3': word[-3:],
        'previous_word': '' if word_index == 0 else words[word_index - 1],
        'following_word': '' if word_index == len(words) - 1 else words[word_index + 1]
    }

word_dicts = list()
for sentence in sentences:
    for index, token in enumerate(sentence):
        word_dicts.append(get_word_features(sentence, index))

In this format the data isn't very large. It seems to only be about 3.3 MB.

Next I setup DictVectorizer, fit it to the data, and try to transform the data with it:

from sklearn.feature_extraction import DictVectorizer

dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
X_train = dict_vectoriser.transform(word_dicts)

At this point I'm getting this error:

numpy.core._exceptions._ArrayMemoryError: Unable to allocate 499. GiB for an array with shape (334043, 200643) and data type float64

This seems to suggest that DictVectorizer has massively increased the size of the data, to nearly 500 GB. Is this normal? Should the output really take up this much memory, or am I doing something wrong?

I looked for solutions and in this thread someone suggested allocating more virtual memory by going into Windows settings and SystemPropertiesAdvanced, unchecking Automatically manage paging file size for all drives, then manually setting the paging file size to a sufficiently large amount. This would be fine if the task required about 100 GB, but I don't have enough storage to allocate 500 GB to the task.

Is there any solution for this? Or do I just need to go and buy myself a larger drive just to have a big enough pagefile? This seems impractical, especially when the initial dataset wasn't even particularly large.

Solution

I worked out a solution. In case it's useful to anybody, here it is. I had been using a data generator later in my workflow, just to feed data to the GPU for processing in batches.

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

Based on the comments I got here I originally tried updating the output here to return batch_x.todense() and changing my code above so that dict_vectoriser = DictVectorizer(sparse=True). As I mentioned in the comments, though, this didn't seem to work.

I've now changed the generator so that, once the dict_vectoriser is created and fitted to the data, it's passed as an argument to the data generator, and it's not called to transform the data until the generator is being used.

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size, x_vec):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.x_vec = x_vec

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x_vec.transform(self.x[idx * self.batch_size:(idx + 1) * self.batch_size])
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

To call it you need to set the batch_size and provide labels, so below y_train is some encoded list of labels corresponding to the x_train data.

dict_vectoriser = DictVectorizer(sparse=False)
dict_vectoriser.fit(word_dicts)
train_generator = DataGenerator(word_dicts, y_train, 200, dict_vectoriser)