Vision Transformer

Inductive bias

Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution. Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

Embedding

Patch Embedding

As the standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, an image $x\in \mathbb{R}^{H\times W\times C}$ is reshaped into a sequence of flattened 2D patches $x_p\in \mathbb{R}^{N\times (P^2\cdot C)}$ , where $N=HW/P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. A trainable linear projection is added to project for projection into $D$ dimensions.

Position Embedding

Standard learnable 1D position embeddings are added to the patch embeddings to retain positional information. Dropout is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings.

`[CLS]` Token

Similar to BERT’s [CLS] token, a learnable embedding is prepended to the sequence of embedded patches ($z_0^0=x_{cls}$), whose state at the output of the Transformer encoder ($z_{L}^0$) serves as the image representation $y$. Both during pre-training and fine-tuning, a classification head is attached to $z_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

The Pytorch implementation of the whole embedding layer is

import torch.nn as nn
import einops
from einops.layers.torch import Rearrange

class ViTEmbeddingLayer(nn.Module):
    def __init__(self, num_patches, patch_height, patch_width, patch_dim, d_model, dropout):
        super(ViTEmbeddingLayer, self).__init__()
        self.img_to_patch = Rearrange('b c (n_h p_h) (n_w p_w) -> b (n_h n_w) (c p_h p_w)', p_h=patch_height, p_w=patch_width)
        self.proj = nn.Sequential(nn.LayerNorm(patch_dim), 
                                  nn.Linear(patch_dim, d_model), 
                                  nn.LayerNorm(d_model))
        self.pos_ebd = nn.Parameters(torch.randn(1, num_patches + 1, d_model))
        seld.cls_ebd = nn.Parameters(torch.randn(1, 1, d_model))
        self.drop = nn.Dropout(dropout)

    def forward(self, input):

        patches = self.proj(self.img_to_patch(input))

        b, n = patches.shape[:2]
        cls = einops.repeat(self.cls_ebd, '1 1 d -> b 1 d', b=b)
        output = torch.cat([cls_tokens, patches], dim=1) + self.pos_ebd[:, :(n+1)]

        return self.drop(output)

Encoder

The Transformer encoder in ViT is a little different from the original Transformer. LayerNorm is peformed before each sub-layer and non-linearity in MLP is GELU.

\[ \begin{aligned} z_0 & = [x_{cls};x_p^1E;x_p^2E;...;x_p^NE]+E_{pos}, E\in \mathbb{R}^{(P^2\cdot C)\times D}, E_{pos}\in \mathbb{R}^{(N+1)\times D}\\ z_l' & = \text{MSA}(\text{LN}(z_{l-1}))+z_{l-1}\qquad l=1...L\\ z_l & = \text{MLP}(\text{LN}(z_l'))+z_l'\qquad \qquad l=1...L\\ y & = \text{LN}(z_L^0) \end{aligned} \]

ViT Pytorch Implementation

Code of other modules can be modified from this blog .

import torch.nn.functional as F

class VisionTransformer(nn.Module):
    def __init__(self, num_classes, img_height, img_width, patch_height, patch_width, patch_dim, d_model, n_heads, d_ff, dropout, gap=False):
        super(VisionTransformer, self).__init__()
        num_patches = (img_height // patch_height) * (img_width // patch_width)
        self.ebd = ViTEmbeddingLayer(num_patches, patch_height, patch_width, patch_dim, d_model, dropout)
        self.blks = nn.ModuleList([EncoderBlock(d_model, n_heads, d_ff, dropout) for _ in range(num_blocks)])
        self.gap = gap
        self.fc = nn.Sequential(nn.LayerNorm(d_model), nn.Linear(d_model, num_classes))

    def forward(self, input):

        output = self.ebd(input, segment)
        for blk in self.blks:
            output = blk(output, mask)

        if self.gap:
            output = einops.reduce(output, 'b n d -> b d', 'mean')
        else:
            output = output[:, 0]
        output = F.softmax(self.fc(output), dim=-1)

        return output

Reference

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929. ↩

Papers

#Computer Vision #Deep Learning

Vision Transformer

https://blog.iks-ran.com/2023/07/27/ViT/

Author

iks-ran

Posted on

July 27, 2023

Licensed under

Generative Adverserial Networks Previous

BERT:Bidirectional Encoder Representations from Transformers Next

Vision Transformer