nlp pytorch bert-language-model allennlp

Can't do lazy loading with allennlp

Currently I'm trying to implement lazy loading with allennlp, but can't. My code is as the followings.

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    # TODO: avoid memory consumption and lazy loading
    train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

When I commented-out train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test')), iterator doesn't work and training is conducted with 0 sample.

Building the vocabulary
100it [00:00, 442.15it/s]01, 133.57it/s]
building vocab: 100it [00:01, 95.84it/s]
100it [00:00, 413.40it/s]
100it [00:00, 138.38it/s]
You provided a validation dataset but patience was set to None, meaning that early stopping is disabled
0it [00:00, ?it/s]
0it [00:00, ?it/s]

I'd like to know if there is any solution for avoid this. Thanks.

Supplement, added at fifth, May.

Currently I am trying to avoid putting all of each sample data on top of memory before training the model.

So I have implemented the _read method as a generator. My understanding is that by calling this method and wrapping it with SimpleDataLoader, I can actually pass the data to the model.

In the DatasetReader, the code for the _read method looks like this. It is my understanding that this is intended to be a generator that avoids memory consumption.

    @overrides
    def _read(self, train_dev_test_flag: str) -> Iterator[Instance]:
        '''
        :param train_dev_test_flag: 'train', 'dev', 'test'
        :return: list of instances
        '''
        if train_dev_test_flag == 'train':
            dataset = self._train_loader()
            random.shuffle(dataset)
        elif train_dev_test_flag == 'dev':
            dataset = self._dev_loader()
        elif train_dev_test_flag == 'test':
            dataset = self._test_loader()
        else:
            raise NotImplementedError(
                "{} is not a valid flag. Choose from train, dev and test".format(train_dev_test_flag))

        if self.config.debug:
            dataset = dataset[:self.config.debug_data_num]

        for data in tqdm(enumerate(dataset)):
            data = self._one_line_parser(data=data, train_dev_test_flag=train_dev_test_flag)
            yield self.text_to_instance(data)

Also, build_data_loaders actually looks like this.

def build_data_loaders(config,
    train_data: List[Instance],
    dev_data: List[Instance],
    test_data: List[Instance]) -> Tuple[DataLoader, DataLoader, DataLoader]:

    train_loader = SimpleDataLoader(train_data, config.batch_size_for_train, shuffle=False)
    dev_loader = SimpleDataLoader(dev_data, config.batch_size_for_eval, shuffle=False)
    test_loader = SimpleDataLoader(test_data, config.batch_size_for_eval, shuffle=False)

    return train_loader, dev_loader, test_loader

But, by somewhat reason I don't know, this code doesn't work.

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

In this code, the SimpleDataLoader will wrap the generator type as it is. I would like to do the lazy loading that allennlp did in the 0.9 version.

But this code iterates training over 0 instances, so currently I have added

train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

before

train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test).

And it works. But this means that I can't train or evaluate the model until I have all the instances in memory. Rather, I want each batch to be called into memory only when it is time to train.

Solution

The SimpleDataLoader is not capable of lazy loading. You should use the MultiProcessDataLoader instead. Setting max_instances_in_memory to a non-zero integer (usually some multiple of your batch size) will trigger lazy loading.