The first model I got to implement as part of my Google Summer of Code project was a Spatial Transformer Network. A Spatial Transformer Network (STN) is a learnable module that can be placed inside a Convolutional Neural Network (CNN) to effectively add spatial invariance. Spatial invariance refers to the model’s invariance to spatial transformations of an image such as rotation, translation, and scaling. Invariance refers to the ability of a model to recognize and identify features even when the input is transformed or slightly modified. Spatial Transformers can be placed inside a CNN to accomplish a variety of tasks. Image classification is one example. Let’s say the task is to classify handwritten digits, where the position, size, and orientation of the digits in each sample vary significantly. A Spatial Transformer will extract, transform, and scale the region of interest in the sample. Now the CNN can accomplish the task of classification. The Spatial Transformer Network consists of 3 main components: (i) Localization Network: This network takes as input a 4D tensor representation of a batch of images (width x height x channels x Batch_Size). It is a simple neural network with a few convolutional layers and a few dense layers. It predicts the transformation parameters as output. These parameters determine the angle by which the input must be rotated, the amount of translation to be done, and the scale factor required to focus on the region of interest in the input feature map. (ii) Sampling Grid Generator: For each image in the batch, the transformation parameters predicted by the localization network are applied in the form of an affine transformation matrix of size 2×3. An affine transformation is a transformation that preserves points, lines, and planes. After an affine transformation, parallel lines remain parallel. Rotation, scaling, and translation are all affine transformations. Here, T is the affine transformation and A is the matrix representing the affine transformation. θ11, θ12, θ21, θ22 are used to determine the angle of image rotation. θ13, θ23 determine the translation of the image along the width and height respectively. Thus, we get a sampling grid of transformation indices. (iii) Bilinear interpolation on transformed indices: Now the image’s indices and axes have been affine transformed. Its pixels have shifted. For example, a point (1,1) becomes (√2,0) after a 45 degree counterclockwise rotation of the axis, so to find the pixel value at the transformed point, we need to perform bilinear interpolation using the four closest pixel values. To find the pixel value at a point (x, y), we take the 4 nearest points as shown above. Here, floor(x) represents the maximum integer function and ceil(x) represents the ceiling function. Linear interpolation must be done in both the x and y directions. Therefore, this function returns the fully transformed image with the appropriate pixel value at the transformed index. The code for a pure Julia implementation of the Spatial Transformer Network can be found here: https://github.com/thebhatman/Spatial-Transformer-Network/blob/master/src/stn.jl. I tested the functionality of my spatial transformer module on some images. Below are some example images of the output of the transformation function. The image on the left is the input to the transformer module and the image on the right is the output.
It is clear from the examples above that the spatial transformer module is capable of performing any type of affine transformation. During the implementation, I spent a lot of time understanding how array reshape, permutedims, and concatenation worked because it was difficult to debug how pixels and indices were moved around when I used these functions. Debugging interpolation and image indexing was the most time consuming and frustrating part of the STN implementation. Now, I plan to train this spatial transformer module using a CNN for handwritten digit classification on a cluttered and distorted MNIST dataset. The spatial transformer will be able to increase the spatial invariance of the CNN and is therefore expected to give good classification results even when the digits are translated, rotated, or scaled. |
<<: Tencent: Fully opening up 5G capabilities to build a "new connection" in the digital era
>>: 5G+Wi-Fi 6 accelerates the Internet of Everything
need: In IoT devices, the TCP/IP network protocol...
Currently, digital transformation is described as...
Since 1994, there have been 12 versions of Blueto...
On June 6, after China issued 5G commercial licen...
According to information from LEB, Novos.be is a ...
On July 28, 2020, the "Shenzhen·Huawei Cloud...
[[248667]] Image: This UK supercomputer can opera...
ZheyeIO has released a 2020 year-end promotion pl...
DiyVM is a Chinese hosting company founded in 200...
With all the hype and anticipation surrounding 5G...
HostYun recently launched the AMD5950X+M.2 SSD pr...
Aoyo Host (aoyozhuji/aoyoyun) has launched this y...
On April 12-13, 2017, the 2017 Asia Pacific CDN S...
According to CAICT's forecast, by 2025, 5G wi...
Korean media cited data from the Ministry of Scie...