I'm working on the dailydialog dataset, which I've converted into a JSON file which looks something like this:
[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]
So, each item has two keys - response and message.
This is my first time using PyTorch, so I was following a few online available resources. These are the relevant snippets of my code:
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
src = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
lower = True)
fields = {'response': ('r', src)}
train_data, test_data, validation_data = TabularDataset.splits(
path = 'FilePath',
train = 'trainset.json',
test = 'testset.json',
validation = 'validationset.json',
format = 'json',
fields = fields
)
Although no errors are raised, despite having many items in my JSON file, the train, test and validation datasets strangely have only 1 example each, as seen in this image: Image Showing the length of train_data, test_data and validation_data
I'd be really grateful if someone could point out the error to me.
Edit: I found out that the whole file is being treated as a single text string due to lack of indents in the file. But if I indent the JSON file, the TabularDataset function throws a JSONDecodeError to me, suggesting it can no more decode the file. How can I get rid of this problem?
I think the code is alright, but the issue is with your JSON file. Can you try removing the square brackets("[]") at the beginning and the end of the file? Probably that is the reason that Your Python file is reading it as one single object.