python machine-learning dataset pytorch dataloader

Python Dataset Class + PyTorch Dataloader: Stuck at getitem, how to get Index, Label and so on during Testing?

I have a, maybe small problem but I am stuck for quite a while now. Hope someone can help me with that. I am currently on a Kddcup99 dataset which I like to train via DeepLearning (CNN Network)

I have a "Dataset" Class which includes the Panda Dataframe. Thus i split up into normal and validate dataset. So far, no problem. I load it into a Numpy vector, torch it to Tensor and then direct it to the DataLoader.

The Dataset Class has these two important classes for iterating through:

def __len__(self):
        return len(self.val_df)

def __getitem__(self, index):        
        img, target = self.val_df[index][:-1], self.val_df[index][-1]
        return img, target, index

Not in the class is the DataLoader string:

test_dataloader = DataLoader(datat.val_df, batch_size=10, shuffle=True)

In my Trainer Class i have a for loop which should iterate through the Dataloader:

with torch.no_grad():
            for data in dataloader:
                inputs, labels, idx = data
                inputs = inputs.to(self.device)

But it won't. I can't access the labels, index and such.

My question is now: Why? How can I access Labels, Index from the given Dataset via the Dataloader?

Thank you all for your help! Much appreciate it.

Solution

The first argument to DataLoader is the dataset from which you want to load the data, that's usually a Dataset, but it's not restricted to any instance of Dataset. As long as it defines the length (__len__) and can be indexed (__getitem__ allows that) it is acceptable.

You are passing datat.val_df to the DataLoader, which is presumably a NumPy array. A NumPy array has a length and can be indexed, so it can be used in the DataLoader. Since you pass that array directly, your dataset's __getitem__ is never called, but the array itself is indexed, so every item is just data.val_df[index].

Instead of using the underlying data for the DataLoader, you have to use the dataset itself (datat):

test_dataloader = DataLoader(datat, batch_size=10, shuffle=True)

Python Dataset Class + PyTorch Dataloader: Stuck at __getitem__, how to get Index, Label and so on during Testing?

Python Dataset Class + PyTorch Dataloader: Stuck at getitem, how to get Index, Label and so on during Testing?