A brief introduction to spatial transformer networks

A brief introduction to spatial transformer networks

The first model I got to implement as part of my Google Summer of Code project was a Spatial Transformer Network. A Spatial Transformer Network (STN) is a learnable module that can be placed inside a Convolutional Neural Network (CNN) to effectively add spatial invariance. Spatial invariance refers to the model’s invariance to spatial transformations of an image such as rotation, translation, and scaling. Invariance refers to the ability of a model to recognize and identify features even when the input is transformed or slightly modified. Spatial Transformers can be placed inside a CNN to accomplish a variety of tasks. Image classification is one example. Let’s say the task is to classify handwritten digits, where the position, size, and orientation of the digits in each sample vary significantly. A Spatial Transformer will extract, transform, and scale the region of interest in the sample. Now the CNN can accomplish the task of classification.

The Spatial Transformer Network consists of 3 main components:

(i) Localization Network: This network takes as input a 4D tensor representation of a batch of images (width x height x channels x Batch_Size). It is a simple neural network with a few convolutional layers and a few dense layers. It predicts the transformation parameters as output. These parameters determine the angle by which the input must be rotated, the amount of translation to be done, and the scale factor required to focus on the region of interest in the input feature map.

(ii) Sampling Grid Generator: For each image in the batch, the transformation parameters predicted by the localization network are applied in the form of an affine transformation matrix of size 2×3. An affine transformation is a transformation that preserves points, lines, and planes. After an affine transformation, parallel lines remain parallel. Rotation, scaling, and translation are all affine transformations.

Here, T is the affine transformation and A is the matrix representing the affine transformation. θ11, θ12, θ21, θ22 are used to determine the angle of image rotation. θ13, θ23 determine the translation of the image along the width and height respectively. Thus, we get a sampling grid of transformation indices.

(iii) Bilinear interpolation on transformed indices: Now the image’s indices and axes have been affine transformed. Its pixels have shifted. For example, a point (1,1) becomes (√2,0) after a 45 degree counterclockwise rotation of the axis, so to find the pixel value at the transformed point, we need to perform bilinear interpolation using the four closest pixel values.

To find the pixel value at a point (x, y), we take the 4 nearest points as shown above. Here, floor(x) represents the maximum integer function and ceil(x) represents the ceiling function. Linear interpolation must be done in both the x and y directions. Therefore, this function returns the fully transformed image with the appropriate pixel value at the transformed index.

The code for a pure Julia implementation of the Spatial Transformer Network can be found here: https://github.com/thebhatman/Spatial-Transformer-Network/blob/master/src/stn.jl. I tested the functionality of my spatial transformer module on some images. Below are some example images of the output of the transformation function. The image on the left is the input to the transformer module and the image on the right is the output.

  1. Zoom in on an area of ​​interest

  1. Enlarge the face and rotate it 45 degrees.

  1. Translate the image across its width, toward the center.

It is clear from the examples above that the spatial transformer module is capable of performing any type of affine transformation. During the implementation, I spent a lot of time understanding how array reshape, permutedims, and concatenation worked because it was difficult to debug how pixels and indices were moved around when I used these functions. Debugging interpolation and image indexing was the most time consuming and frustrating part of the STN implementation.

Now, I plan to train this spatial transformer module using a CNN for handwritten digit classification on a cluttered and distorted MNIST dataset. The spatial transformer will be able to increase the spatial invariance of the CNN and is therefore expected to give good classification results even when the digits are translated, rotated, or scaled.

<<:  Tencent: Fully opening up 5G capabilities to build a "new connection" in the digital era

>>:  5G+Wi-Fi 6 accelerates the Internet of Everything

Recommend

Essential for IoT experts: Network protocol stack LwIP (I)

need: In IoT devices, the TCP/IP network protocol...

4 major roles of the network in enterprise digital transformation

Currently, digital transformation is described as...

Bluetooth 4.0 Beacons vs Bluetooth 5.0 Beacons: Technology Comparison

Since 1994, there have been 12 versions of Blueto...

Novos: €8/month KVM-2GB/40G NVMe+1TB/25TB/Belgium

According to information from LEB, Novos.be is a ...

The important thing in wireless in 2020 may not be 5G

With all the hype and anticipation surrounding 5G...

[11.11] Maxthon consoles are 32% off, top up 611 yuan and get 111 yuan for free

Aoyo Host (aoyozhuji/aoyoyun) has launched this y...

5G+Wi-Fi 6 accelerates the Internet of Everything

According to CAICT's forecast, by 2025, 5G wi...

South Korea has nearly 13 million 5G users as of January

Korean media cited data from the Ministry of Scie...