github, paper

In a recent paper, Learning to Resize Images for Computer Vision Tasks by Google AI, the authors proposed a new way to resize images before feeding the image to an image classifier. Normally, we would resize the images to 224x224 spatial resolution using some interpolation method (like bilinear, bicubic) but what if we replaced the fixed resizer with a learnable resizer? This is the question that the paper aims to answer.

In summary, we want to learn a model that will learn to resize images (e.g. input image size is 448x448 and the output is of size 224x224 which will be passed to an image classifier, like ResNet). The resized images would not have the same visual quality as the original image but instead it will improve task performance. An example of the images outputted by this model is shown in Figure 1 and as we can see these clearly do not resemble real life images.

Why we need to resize images?

The first step in almost every image processing pipeline is the down-scaling of images to a fixed resolution. This can be done using biliearn interpolation, nearest neighbor sampling (in case of image segmentation), cropping the image to desired resolution.

The reasons for doing this include the memory limitations on the GPU. At high resolution, we have to reduce the batch size and at the same time training and inference are slower. Also, mini-batch gradient descent requires inputs to be of the same size.

Resizer model

The proposed Resizer model is used to down-scale the images. Now due to the inherent limitations discussed above, we have to pass a fixed size images to the Resizer model also, which may seem counterintuitve in the start. But here is the pipiline with and without the resizer model.

Without resizer model

Original Image -> Downscale to 224x224 -> ResNet

With resizer model

Original Image -> Downscale to 448x448 -> Resizer to downscale to 224x224 -> ResNet

As we can see from above, we are downscaling the image to 448x448, which has 4x times more pixels that 224x224. This means the model has 4x times more information for the downstream task.

Note: The value 448x448 is completely arbitrary and you can downscale to whatever resolution you want.

The proposed resizer model is shown in Figure 2.

The architecture of the model is pretty simple and the PyTorch code for the network is shown below

class ResBlock(nn.Module):
    def __init__(self, channel_size: int, negative_slope: float = 0.2):
        self.block = nn.Sequential(
            nn.Conv2d(channel_size, channel_size, kernel_size=3, padding=1,
            nn.LeakyReLU(negative_slope, inplace=True),
            nn.Conv2d(channel_size, channel_size, kernel_size=3, padding=1,

    def forward(self, x):
        return x + self.block(x)

class Resizer(nn.Module):
    def __init__(self,
                 interpolate_mode: str = "bilinear",
                 input_image_size: int = 448,
                 output_image_size: int = 224,
                 num_kernels: int = 16,
                 num_resblocks: int = 2,
                 slope: float = 0.2,
                 in_channels: int = 3,
                 out_channels: int = 3):
        scale_factor = input_image_size / output_image_size

        n = num_kernels
        r = num_resblocks

        self.module1 = nn.Sequential(
            nn.Conv2d(3, n, kernel_size=7, padding=3),
            nn.LeakyReLU(slope, inplace=True),
            nn.Conv2d(n, n, kernel_size=1),
            nn.LeakyReLU(slope, inplace=True),

        resblocks = []
        for i in range(r):
            resblocks.append(ResBlock(n, slope))
        self.resblocks = nn.Sequential(*resblocks)

        self.module3 = nn.Sequential(
            nn.Conv2d(n, n, kernel_size=3, padding=1, bias=False),

        self.module4 = nn.Conv2d(n, out_channels, kernel_size=7,

        self.interpolate = partial(F.interpolate,

    def forward(self, x):
        residual = self.interpolate(x)

        out = self.module1(x)
        out_residual = self.interpolate(out)

        out = self.resblocks(out_residual)
        out = self.module3(out)
        out = out + out_residual

        out = self.module4(out)

        out = out + residual

        return out

Now we can directly pass the output of this model as input to the baseline model (ResNet-50) and do image classification. Both the models would be learned jointly. The code to do this is shown below

def forward(self, x):
    if self.resizer_model is not None:
        x = self.resizer_model(x)
    x = self.base_model(x)
    return x


Imagenette and Imagewoof datasets are used to test the performance of the proposed model. These datasets contains 10 classes each with Imagenette being the easier dataset and Imagewoof containing the 10 hardest classes. The instructions on how to download the datasets are provided in the README.

The results of the model with ResNet-50 as the baseline model are shown below

Dataset Model Acc
Imagenette ResNet-50 81.07
Resizer + ResNet-50 82.16
Imagewoof ResNet-50 58.13
Resizer + ResNet-50 65.20

As we can see from the above table, using the proposed Resizer model does give performance improvement. But it also makes the task of model interpretation even harder. Now instead of interpreting the results from a single model, we have to consider two models which may deploying this model even harder. Also, the trained baseline models cannot be directly used for transfer learning as the distribution of the images produced by the resizer model is very different from the original images, which means we have to fine-tune the resizer model during transfer learning.

The instructions on reporducing the experiments are provided in the README.

twitter, linkedin, github