The official document of AllenNLP suggests specifying "validation_data_path" in the configuration file, but what if one wants to construct a dataset from a single source and then randomly split it into train and validation datasets with a given ratio?
Does AllenNLP support this? I would greatly appreciate your comments.
AllenNLP does not have this functionality yet, but we are working on some stuff to get there.
In the meantime, here is how I did it for the VQAv2 reader: https://github.com/allenai/allennlp-models/blob/main/allennlp_models/vision/dataset_readers/vqav2.py#L354
This reader supports Python slicing syntax where you, for example, specify a data_path
as "my_source_file[:1000]"
to take the first 1000 instances from my_source_file
. You can also supply multiple paths by setting data_path: ["file1", "file2[:1000]", "file3[1000-"]]
. You can probably steal the top two blocks in that file (line 354 to 369) and put them into your own dataset reader to achieve the same result.