python-imaging-library transform torchvision

Different results with torchvision transforms

Correct me if I am wrong. The 'classic' way to pass images through torchvision transforms is to use Compose as in its doc page. This, however, requires to pass Image input. An alternative is to use ConvertImageDtype with torch.nn.Sequential. This 'bypasses' the need for Image, and in my case it is much faster because I work with numpy arrays.

My problem is that results are not identical. Below is an example with custom Normalize. I would like to use torch.nn.Sequential (tr) because it is faster for my needs, but the error compared to Compose (tr2) is very large (~810).

from PIL import Image
import torchvision.transforms as T
import numpy as np
import torch

o = np.random.rand(64, 64, 3) * 255
o = np.array(o, dtype=np.uint8)
i = Image.fromarray(o)

tr = torch.nn.Sequential(
    T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ConvertImageDtype(torch.float),
    T.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
)

tr2 = T.Compose([
    T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
])

out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())

out2 = tr2(i)

print(((out - out2) ** 2).sum())

The interpolation method seems to matter A LOT, and if I use the default BILINEAR the error is ~7, but I need to use BICUBIC.

The problem seems to lie in ConvertImageDtype vs ToTensor, because if I replace ToTensor with ConvertImageDtype results are identical (cannot do the other way around because ToTensor is not a subclass of Module and I cannot use it with nn.Sequential).

However, the following gives identical results

tr = torch.nn.Sequential(
    T.ConvertImageDtype(torch.float),
)

tr2 = T.Compose([
    T.ToTensor(),
])

out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())

out2 = tr2(i)

print(((out - out2) ** 2).sum())

This means that the interpolation changes something in the results, which matters only when I use ToTensor vs ConvertImageDtype.

Any input is appreciated.

Solution

This is documented here:

The output image might be different depending on its type: when downsampling, the interpolation of PIL images and tensors is slightly different, because PIL applies antialiasing. This may lead to significant differences in the performance of a network. Therefore, it is preferable to train and serve a model with the same input types. See also below the antialias parameter, which can help making the output of PIL images and tensors closer.

Passing antialias=True produces almost identical results. This is interesting because the doc says that

it can be set to True for InterpolationMode.BILINEAR only mode.

Yet, I am using BICUBIC and still works.