Search code examples
nlppytorchbert-language-modelallennlp

Can't do lazy loading with allennlp


Currently I'm trying to implement lazy loading with allennlp, but can't. My code is as the followings.

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    # TODO: avoid memory consumption and lazy loading
    train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

When I commented-out train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test')), iterator doesn't work and training is conducted with 0 sample.

Building the vocabulary
100it [00:00, 442.15it/s]01, 133.57it/s]
building vocab: 100it [00:01, 95.84it/s]
100it [00:00, 413.40it/s]
100it [00:00, 138.38it/s]
You provided a validation dataset but patience was set to None, meaning that early stopping is disabled
0it [00:00, ?it/s]
0it [00:00, ?it/s]

I'd like to know if there is any solution for avoid this. Thanks.


Supplement, added at fifth, May.

Currently I am trying to avoid putting all of each sample data on top of memory before training the model.

So I have implemented the _read method as a generator. My understanding is that by calling this method and wrapping it with SimpleDataLoader, I can actually pass the data to the model.

In the DatasetReader, the code for the _read method looks like this. It is my understanding that this is intended to be a generator that avoids memory consumption.

    @overrides
    def _read(self, train_dev_test_flag: str) -> Iterator[Instance]:
        '''
        :param train_dev_test_flag: 'train', 'dev', 'test'
        :return: list of instances
        '''
        if train_dev_test_flag == 'train':
            dataset = self._train_loader()
            random.shuffle(dataset)
        elif train_dev_test_flag == 'dev':
            dataset = self._dev_loader()
        elif train_dev_test_flag == 'test':
            dataset = self._test_loader()
        else:
            raise NotImplementedError(
                "{} is not a valid flag. Choose from train, dev and test".format(train_dev_test_flag))

        if self.config.debug:
            dataset = dataset[:self.config.debug_data_num]

        for data in tqdm(enumerate(dataset)):
            data = self._one_line_parser(data=data, train_dev_test_flag=train_dev_test_flag)
            yield self.text_to_instance(data)

Also, build_data_loaders actually looks like this.

def build_data_loaders(config,
    train_data: List[Instance],
    dev_data: List[Instance],
    test_data: List[Instance]) -> Tuple[DataLoader, DataLoader, DataLoader]:

    train_loader = SimpleDataLoader(train_data, config.batch_size_for_train, shuffle=False)
    dev_loader = SimpleDataLoader(dev_data, config.batch_size_for_eval, shuffle=False)
    test_loader = SimpleDataLoader(test_data, config.batch_size_for_eval, shuffle=False)

    return train_loader, dev_loader, test_loader

But, by somewhat reason I don't know, this code doesn't work.

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

In this code, the SimpleDataLoader will wrap the generator type as it is. I would like to do the lazy loading that allennlp did in the 0.9 version.

But this code iterates training over 0 instances, so currently I have added

train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

before

train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test).

And it works. But this means that I can't train or evaluate the model until I have all the instances in memory. Rather, I want each batch to be called into memory only when it is time to train.


Solution

  • The SimpleDataLoader is not capable of lazy loading. You should use the MultiProcessDataLoader instead. Setting max_instances_in_memory to a non-zero integer (usually some multiple of your batch size) will trigger lazy loading.