Search code examples
pytorchdataset

Pytorch: How to prepare 1d dataset from pandas dataframe?


I am trying to make 1d Dataset from a pandas data frame, however, the output is weird.

I wrote the code to convert dataset from pandas dataframe: size is 8000x512,

# create dataset
class carte_dataset(Dataset):
    def __init__(self,root):
        self.root = root
        self.df = pd.read_csv(root,index_col=0)
        self.X = torch.tensor(self.df.iloc[:,1:].values)
        self.regi_no =  self.df.iloc[:,0].values
        
    def __len__(self):
        return len(self.regi_no)

    def __getitem__(self,idx):
        return self.X[idx],self.regi_no[idx]

Then, I confirmed the tensor size

dataset = carte_dataset(root)    
data,_ = dataset.__getitem__(0)
data.size()

I expected the size was torch.Size([1,512]), but the shape was torch.Size([512]).

Is the way to make 1d dataset from the pandas dataframe appropriate? Also, if this way is incorrect, how I should revise this code?


Solution

  • What you need to do is to wrap the dataset with the dataloader which will have the effect of

    1. retrieving the individual element tuple pairs from the underlying dataset: self.X[idx], self.regi_no[idx], shaped (512,) and (1,) respectively.

    2. and collating them to form two batches of input/labels shaped (bs, 512) and bs, 1) where bs is the batch size.

    The standard dataloader utility in PyTorch is torch.utils.data.DataLoader:

    >>> dataloader = data.DataLoader(dataset, batch_size=1, shuffle=False)
    

    Then you can iterate through the dataset via the dataloader:

    >>> for x, y in dataloader:
    ...     # x shaped (1, 512), corresponds to [X[0]]
    ...     # y shaped (1, 1), corresponds to [regi_no[0]]