Search code examples
jsonnlppytorchdataloadertorchtext

Loading json file using torchtext


I'm working on the dailydialog dataset, which I've converted into a JSON file which looks something like this:

[{"response": "You know that is tempting but is really not good for our fitness.", "message": "Say, Jim, how about going for a few beers after dinner?"}, {"response": "Do you really think so? I don't. It will just make us fat and act silly. Remember last time?", "message": "What do you mean? It will help us to relax."}, {"response": "I suggest a walk over to the gym where we can play singsong and meet some of our friends.", "message": "I guess you are right. But what shall we do? I don't feel like sitting at home."}, {"response": "Sounds great to me! If they are willing, we could ask them to go dancing with us.That is excellent exercise and fun, too.", "message": "That's a good idea. I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them."}, {"response": "All right.", "message": "Please lie down over there."}]

So, each item has two keys - response and message.

This is my first time using PyTorch, so I was following a few online available resources. These are the relevant snippets of my code:

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

src = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

fields = {'response': ('r', src)}

train_data, test_data, validation_data = TabularDataset.splits(     
                                        path = 'FilePath',
                                        train = 'trainset.json',
                                        test = 'testset.json',
                                        validation = 'validationset.json',
                                        format = 'json',
                                        fields = fields        
)

Although no errors are raised, despite having many items in my JSON file, the train, test and validation datasets strangely have only 1 example each, as seen in this image: Image Showing the length of train_data, test_data and validation_data

I'd be really grateful if someone could point out the error to me.

Edit: I found out that the whole file is being treated as a single text string due to lack of indents in the file. But if I indent the JSON file, the TabularDataset function throws a JSONDecodeError to me, suggesting it can no more decode the file. How can I get rid of this problem?


Solution

  • I think the code is alright, but the issue is with your JSON file. Can you try removing the square brackets("[]") at the beginning and the end of the file? Probably that is the reason that Your Python file is reading it as one single object.