Search code examples
deep-learningpytorchconv-neural-networkartificial-intelligencemnist

PyTorch crashes when training: probable image decoding error, tensor value, corrupt image. (Runtime Error)


Premise

I am fairly new to using PyTorch, and more times than not I am getting a segfault when training my neural network with a small custom dataset (10 images of 90 classifications).

The output below is from these print statements that ran twice (with MNIST dataset at idx 0 and my custom dataset at idx 0). Both datasets were compiled using a csv file formatted the exact same (img_name, class) and with the image directory MNIST subet is of size 30, and my custom dataset is of size 10:

example, label = dataset[0]
print(dataset[0])
print(example.shape)
print(label)

The first tensor is an MNIST 28X28 png converted to a tensor using:

image = torchvision.io.read_image().type(torch.FloatTensor)

This was so I had a working dataset to compare to. It uses the same custom dataset class as the custom data I have.

The Neural Net class is the exact same as my custom data NN except it has 10 outputs as opposed to the 90 from my custom data.

The custom data is of varied sizes, that have all been resized to 28 X 28 using the transforms.Compose() listed below. In this 10 image subset of the data, there are images that are dimensions 800X170, 96X66, 64X34, 208X66, etc...

The second tensor output is from a png that was of size 800 X 170.

The transforms performed on both datasets are the exact same:

tf=transforms.Compose([
        transforms.Resize(size = (28,28)),
        transforms.Normalize(mean=[-0.5/0.5],std=[1/0.5])
        ])

There is no target transforms performed.

Output of tensor, tensor size, class, and train/test performed at end

(tensor([[[  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  19.5000,
          119.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000,  93.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000, 106.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000, 106.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000, 106.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.5000,
          127.5000, 107.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  63.5000,
          127.0000, 106.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  59.0000,
          127.0000,  58.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000,  66.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000, 106.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  33.0000,
          128.0000, 107.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  32.5000,
          127.0000,  88.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  59.5000,
          127.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.0000,
          127.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.0000,
          127.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.5000,
          128.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.0000,
          127.0000,  54.0000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.0000,
          127.0000,  60.0000,   8.0000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  85.0000,
          127.0000, 127.5000,  84.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,  28.0000,
          118.5000,  65.5000,  14.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000],
         [  0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,
            0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000,   0.5000]]]), 1)
torch.Size([1, 28, 28])
1

Train Epoch: 1 [0/25 (0%)]  Loss: -1.234500

Test set: Average loss: -1.6776, Accuracy: 1/5 (20%)

(tensor([[[68.1301, 67.3571, 68.4286, 67.9375, 69.5536, 69.2143, 69.0026,
          69.2283, 70.4464, 70.2857, 68.8839, 68.6071, 71.3214, 70.5102,
          71.0753, 71.9107, 71.5179, 71.5625, 73.6071, 71.9464, 73.2513,
          72.5804, 73.5000, 74.1429, 72.7768, 72.9107, 73.1786, 74.9069],
         [68.2028, 70.0714, 68.4821, 69.3661, 70.8750, 69.6607, 70.6569,
          70.2551, 70.9464, 70.3393, 70.3929, 71.3571, 71.1250, 72.1901,
          70.6850, 71.9464, 72.1071, 72.8304, 72.3036, 72.3214, 73.4528,
          73.4898, 72.4286, 73.0179, 73.1071, 73.5179, 73.0357, 74.0280],
         [71.3457, 70.4643, 70.4464, 70.7857, 70.6071, 71.9821, 71.6786,
          72.7564, 72.4107, 72.2321, 72.8571, 72.7321, 70.0357, 72.2640,
          73.8214, 72.8750, 73.0000, 73.0089, 74.8393, 74.1964, 74.9872,
          73.4248, 72.0179, 74.5357, 74.9018, 74.9821, 75.0357, 72.9286],
         [70.1429, 70.3750, 69.8750, 70.6250, 69.8750, 72.8750, 71.4107,
          71.5089, 73.3750, 73.2500, 74.4375, 73.8750, 73.0000, 74.4375,
          72.2768, 72.7500, 72.6250, 72.6250, 73.1250, 73.2500, 72.3571,
          73.0625, 72.5000, 74.8750, 73.6875, 74.2500, 75.2500, 73.7411],
         [53.1428, 56.1607, 57.4286, 58.3393, 60.6607, 59.3393, 62.2589,
          62.8380, 64.1250, 66.6429, 66.9821, 67.8750, 74.7679, 70.5192,
          68.7411, 69.3036, 66.0001, 67.9733, 67.4822, 68.3393, 68.3534,
          69.5740, 69.4465, 70.9465, 69.0983, 72.2679, 70.4286, 70.1493],
         [61.2143, 63.0000, 69.0357, 65.3393, 62.3214, 59.8036, 56.2730,
          54.5829, 52.8393, 52.8929, 50.8304, 52.9107, 66.4643, 69.6875,
          71.1849, 72.2678, 73.9821, 74.4643, 73.0357, 74.1250, 75.6492,
          76.2360, 75.7679, 75.6071, 75.2857, 74.9286, 74.8929, 75.1850],
         [54.9439, 62.5357, 69.7143, 72.0000, 71.2500, 74.1607, 75.9987,
          79.6416, 79.5179, 81.4822, 77.3214, 75.2143, 49.6071, 59.7513,
          71.4350, 74.4822, 73.5000, 73.8214, 72.2322, 73.7143, 73.9822,
          74.5893, 74.7322, 74.8572, 76.2947, 71.5714, 73.4822, 74.8533],
         [63.4298, 61.0357, 61.6072, 59.6697, 57.8036, 59.2322, 56.5982,
          57.2079, 55.3393, 56.3572, 56.5804, 58.7322, 79.7499, 73.1900,
          65.2423, 75.5357, 74.5356, 75.6250, 72.5893, 74.7321, 74.6135,
          75.8852, 75.6964, 75.7678, 76.4286, 74.2500, 74.7857, 76.1671],
         [63.7870, 60.3750, 67.5179, 67.5446, 66.7857, 66.2857, 66.4515,
          68.5089, 68.5714, 67.0714, 68.5982, 66.7678, 57.3929, 67.2806,
          68.9503, 72.9286, 74.0893, 73.4911, 74.2143, 73.3393, 72.4873,
          73.3916, 71.7500, 75.4821, 73.8393, 74.8750, 74.6429, 75.0906],
         [72.9260, 69.0178, 67.9643, 69.2321, 67.5178, 67.3750, 66.3814,
          64.8890, 63.8572, 64.9464, 66.9821, 66.3928, 63.0000, 64.7449,
          74.8800, 63.5178, 72.2143, 73.2321, 74.9286, 74.5893, 71.6938,
          74.8635, 73.9107, 75.5536, 75.8036, 76.2857, 76.3750, 75.2564],
         [72.1160, 69.5000, 72.0000, 69.4375, 71.2500, 70.5000, 72.3392,
          73.5982, 71.5000, 72.3750, 68.8750, 67.1249, 65.3750, 60.2856,
          61.6427, 65.3749, 67.4999, 65.0624, 70.4999, 69.4999, 65.3124,
          71.9107, 69.7499, 72.8750, 72.5625, 72.7500, 74.8750, 73.7053],
         [64.3763, 64.8571, 70.4642, 66.7857, 64.3214, 65.3928, 67.4859,
          68.7385, 67.8750, 67.8750, 71.0267, 72.8749, 67.5356, 59.4106,
          58.7625, 70.2319, 62.5534, 65.7141, 68.1249, 69.0713, 65.2013,
          72.8392, 67.1427, 71.7500, 72.8482, 72.6071, 74.4285, 74.0051],
         [69.7219, 71.8214, 67.4464, 68.6518, 66.0178, 66.1071, 65.5089,
          65.6964, 65.6964, 61.0714, 61.4375, 61.8214, 67.8214, 61.8762,
          57.3354, 66.8749, 63.8571, 60.3302, 62.9999, 67.8214, 68.9043,
          71.6365, 67.5357, 75.6250, 74.6518, 73.6071, 74.5178, 75.3877],
         [72.2857, 66.2857, 63.1964, 69.2232, 68.8214, 70.2857, 68.7895,
          70.2436, 70.1250, 66.8750, 69.9643, 66.0893, 52.8393, 60.3201,
          52.9273, 66.8571, 58.0535, 57.3035, 63.2321, 60.1785, 59.6058,
          69.9936, 69.4286, 73.4821, 72.7143, 72.8750, 72.7500, 74.0791],
         [65.7334, 56.6430, 60.7143, 67.8035, 66.5178, 65.8214, 67.6760,
          67.3061, 65.6964, 64.5893, 53.1430, 68.4820, 52.7676, 48.1604,
          48.1311, 65.3034, 51.9640, 61.8213, 59.6605, 57.3927, 54.6974,
          75.5752, 73.1250, 74.3928, 74.0446, 72.2142, 72.2857, 77.7806],
         [55.4095, 60.0893, 69.7142, 66.0892, 66.8750, 65.6607, 67.1926,
          66.3712, 63.0000, 56.9465, 41.6073, 48.6609, 61.8035, 39.7281,
          44.9195, 61.5892, 47.5891, 62.7678, 56.9641, 55.9820, 58.1236,
          70.0548, 70.3750, 69.8392, 68.1517, 72.0535, 76.5893, 65.4489],
         [60.6237, 66.5714, 67.8571, 65.7232, 66.2500, 67.6250, 66.9311,
          67.3303, 64.8214, 48.9644, 45.9019, 49.4108, 51.6608, 43.9259,
          47.5012, 38.9642, 37.5356, 66.0000, 65.5178, 49.3392, 57.3571,
          67.8252, 69.7678, 70.2143, 51.7410, 76.1607, 69.7143, 54.4056],
         [61.9643, 67.2500, 66.5000, 65.6875, 66.2500, 65.0000, 65.0625,
          65.5268, 63.7500, 49.8750, 50.4375, 53.1250, 38.7500, 25.3750,
          43.4286, 31.1250, 35.3750, 59.7500, 63.3750, 39.5000, 51.8125,
          58.6249, 69.5000, 70.1250, 48.0000, 75.8750, 48.7500, 61.4018],
         [67.8915, 65.7500, 66.3035, 66.5982, 66.0357, 64.9464, 65.4643,
          65.8074, 63.4643, 56.2325, 48.3306, 54.9467, 22.0715, 23.6990,
          29.0955, 27.3211, 29.4997, 57.8660, 68.2321, 36.9819, 50.7715,
          52.6707, 69.7143, 71.3392, 55.5534, 45.7855, 62.9463, 64.1556],
         [63.8431, 66.0893, 65.3571, 65.6161, 65.0893, 64.6964, 64.3444,
          65.1225, 62.9107, 57.4287, 57.3216, 54.9287, 26.4465, 30.5689,
          23.2499, 23.5534, 25.1605, 55.1071, 69.4643, 41.9642, 52.6619,
          59.8954, 72.0893, 79.7322, 47.2856, 64.5000, 52.9463, 81.6888],
         [64.2589, 69.9643, 71.5000, 75.2857, 77.6786, 78.6429, 76.2513,
          71.0089, 67.5536, 60.8929, 57.2501, 48.1072, 22.4821, 44.3316,
          17.5369, 24.3928, 22.8214, 45.4821, 67.8036, 35.4821, 43.7028,
          52.7806, 81.8929, 56.7321, 60.5357, 44.2321, 82.6964, 72.7500],
         [63.6748, 61.8929, 58.0001, 41.7859, 47.3037, 35.2502, 40.0525,
          63.9669, 76.1962, 74.6603, 67.2228, 43.3748, 19.9821, 37.0776,
          15.6544, 30.9823, 22.0182, 51.0984, 65.8215, 32.5717, 49.4747,
          39.5946, 49.5359, 55.7859, 40.7681, 81.7857, 76.0357, 73.2832],
         [60.0192, 53.6429, 43.5359, 44.8037, 39.9287, 48.8037, 48.3241,
          35.5882, 22.6071, 20.7142, 33.8838, 45.3570, 25.0714, 32.6657,
          26.8559, 22.9644, 27.7324, 69.4375, 62.5001, 33.9823, 48.6047,
          33.4811, 38.3930, 58.5358, 74.2857, 73.2679, 68.8572, 71.0817],
         [63.2500, 63.3393, 43.1608, 50.3751, 68.6786, 69.6429, 63.9324,
          65.5510, 59.6249, 54.3035, 40.5267, 20.6071, 32.1785, 31.9834,
          30.0791, 20.3036, 34.1073, 71.0000, 56.2322, 48.2501, 42.9695,
          37.1225, 53.7322, 68.3750, 76.2232, 72.4822, 70.6072, 72.9324],
         [63.1071, 64.1250, 65.7500, 41.7500, 26.2500, 25.6250, 25.1071,
          24.1339, 18.8750, 23.5000, 35.5625, 44.5000, 31.1250, 37.3393,
          28.3125, 23.6250, 39.3750, 67.1875, 60.7500, 53.2500, 41.6250,
          39.1339, 61.2500, 81.0000, 71.3125, 70.8750, 71.5000, 72.1339],
         [67.4796, 68.1429, 68.9821, 76.4286, 75.0893, 74.6250, 73.8419,
          72.7398, 58.4108, 44.3572, 33.2322, 19.8036, 32.6965, 29.7296,
          28.5957, 19.8750, 42.7499, 69.9196, 66.3214, 51.9285, 43.6848,
          44.9017, 64.2857, 73.2857, 71.7321, 71.4286, 73.9286, 73.5893],
         [67.7080, 67.9465, 68.0358, 69.1786, 69.1071, 69.7857, 69.0650,
          70.3635, 60.1247, 52.3744, 52.1690, 44.3031, 30.2678, 29.7014,
          20.1314, 25.4645, 45.8042, 74.2947, 63.4110, 56.0183, 49.2722,
          50.1485, 73.1251, 74.6608, 74.3036, 73.8572, 72.2322, 74.1570],
         [67.5868, 68.5179, 68.1786, 66.9018, 67.3215, 67.9822, 67.2628,
          65.4694, 49.2318, 43.7318, 39.5888, 47.7318, 29.2499, 28.3277,
          15.6326, 30.8215, 34.2502, 64.6428, 63.3572, 63.0001, 50.1688,
          51.6037, 77.5000, 75.8215, 73.7501, 74.9286, 74.3572, 74.6097]]]), 20)
torch.Size([1, 28, 28])
20
Train Epoch: 1 [0/8 (0%)]   Loss: -1.982941

Test set: Average loss: 0.0000, Accuracy: 0/2 (0%)

Error information

This output is when it ran successfully with no segfault, the segfault usually occurs 4 times out of 5. When a segfault does occur, it never occurs processing the MNIST subset, it only occurs while attempting to access the dataset either at dataset[0] or whichever 1, or literally any of them, but if I run the simple print statements enough times on any of the indices I can get it to output at least once and not crash. Here is an occasion when it crashed more gracefully (outputted the tensor info and size/class, but crashed upon train:

torch.Size([1, 28, 28])
65
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
    989         try:
--> 990             data = self._data_queue.get(timeout=timeout)
    991             return (True, data)

9 frames
/usr/lib/python3.7/queue.py in get(self, block, timeout)
    178                         raise Empty
--> 179                     self.not_empty.wait(remaining)
    180             item = self._get()

/usr/lib/python3.7/threading.py in wait(self, timeout)
    299                 if timeout > 0:
--> 300                     gotit = waiter.acquire(True, timeout)
    301                 else:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
     65         # Python can still get and update the process status successfully.
---> 66         _error_if_any_worker_fails()
     67         if previous_handler is not None:

RuntimeError: DataLoader worker (pid 1132) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-02c9a53ca811> in <module>()
     68 
     69 if __name__ == '__main__':
---> 70     main()

<ipython-input-9-02c9a53ca811> in main()
     60 
     61     for epoch in range(1, args.epochs + 1):
---> 62         train(args, model, device, train_loader, optimizerAdadelta, epoch)
     63         test(model, device, test_loader)
     64         scheduler.step()

<ipython-input-6-93be0b7e297c> in train(args, model, device, train_loader, optimizer, epoch)
      2 def train(args, model, device, train_loader, optimizer, epoch):
      3     model.train()
----> 4     for batch_idx, (data, target) in enumerate(train_loader):
      5         data, target = data.to(device), target.to(device)
      6         optimizer.zero_grad()

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
   1184 
   1185             assert not self._shutdown and self._tasks_outstanding > 0
-> 1186             idx, data = self._get_data()
   1187             self._tasks_outstanding -= 1
   1188             if self._dataset_kind == _DatasetKind.Iterable:

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _get_data(self)
   1140         elif self._pin_memory:
   1141             while self._pin_memory_thread.is_alive():
-> 1142                 success, data = self._try_get_data()
   1143                 if success:
   1144                     return data

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
   1001             if len(failed_workers) > 0:
   1002                 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1003                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
   1004             if isinstance(e, queue.Empty):
   1005                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 1132) exited unexpectedly

Generally speaking however this issue appears to 'crash for an unknown reasons' and here is what my logs look like when that occurs:

logs

What I think is going on/what I have tried

I think that there is something wrong with the tensor information and how the image is being read. I am only working with maximum 40 images at a single time so there is no reason the disk resources or RAM on Google Colab are failing. I might be normalizing the data improperly, I have tried different values but nothing has fixed it yet. Perhaps the images are corrupt?

I don't really have a solid grasp of what could be going on, otherwise, I would have already solved it. I think I provided ample resources for it to be a glaring issue for someone of expertise in the area. I put a lot of time into this post, I hope someone is able to help me get to the bottom of the problem.

If there are any other obvious issues with my code and my use of the network and custom dataset please let me know, as this is my first time working with PyTorch.

Thank you!

Additional information that I am not sure if it is relevant:

Custom dataset class:

# ------------ Custom Dataset Class ------------
class PhytoplanktonImageDataset(Dataset):
  def __init__(self, annotations_file, img_dir, transform, target_transform):
    self.img_labels = pd.read_csv(annotations_file) # Image name and label file loaded into img_labels
    self.img_dir = img_dir # directory to find all image names
    self.transform = transform # tranforms to apply to images
    self.target_transform = target_transform

  def __len__(self):
    return len(self.img_labels) # get length of csv file
  
  def __getitem__(self, idx):
    img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0]) 
    image = torchvision.io.read_image(path=img_path) 
    image = image.type(torch.FloatTensor)
    label = self.img_labels.iloc[idx,1]
    if self.transform:
      image = self.transform(image)
    if self.target_transform:
      label = self.target_transform(label)
    return image, label

NN class (only thing changed is nn.Linear() has 10 outputs for MNIST:

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 90),
            nn.ReLU()
        )
    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

Args used:

args = parser.parse_args(['--batch-size', '64', '--test-batch-size', '64', 
                            '--epochs', '1', '--lr', '0.01', '--gamma', '0.7', '--seed','4', 
                            '--log-interval', '10'])

Edit: I was able to get the following exit gracefully on one of the runs (this traceback was a ways into the getitem call):

<ipython-input-3-ae5ff8635158> in __getitem__(self, idx)
     13     img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0]) # image path
     14     print(img_path)
---> 15     image = torchvision.io.read_image(path=img_path) # Reading image to 1 dimensional GRAY Tensor uint between 0-255
     16     image = image.type(torch.FloatTensor) # Now a FloatTensor (not a ByteTensor)
     17     label = self.img_labels.iloc[idx,1] # getting label from csv

/usr/local/lib/python3.7/dist-packages/torchvision/io/image.py in read_image(path, mode)
    258     """
    259     data = read_file(path)
--> 260     return decode_image(data, mode)

/usr/local/lib/python3.7/dist-packages/torchvision/io/image.py in decode_image(input, mode)
    237         output (Tensor[image_channels, image_height, image_width])
    238     """
--> 239     output = torch.ops.image.decode_image(input, mode.value)
    240     return output
    241 

RuntimeError: Internal error.

Here is the image path being printed just before the decoding fails: /content/gdrive/My Drive/Colab Notebooks/all_images/sample_10/D20190926T145532_IFCB122_00013.png and here is what that image looks like: image

Information about this image:

Color Model: Gray

Depth: 16

Pixel Height: 50

Pixel Width: 80

Image DPI: 72 pixels per inch

file size: 3,557 bytes


Solution

  • I suggest to take a look at you num workers param in your dataloader. If you have a num_workers param that is too high it may be causing this error. Therefore, I suggest to lower it to zero or until you don't get this error.

    Sarthak