Vision Transformer
Vision Transformer
Inductive bias
Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution. Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.
Embedding
Patch Embedding
As the standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, an image $x\in \mathbb{R}^{H\times W\times C}$ is reshaped into a sequence of flattened 2D patches $x_p\in \mathbb{R}^{N\times (P^2\cdot C)}$ , where $N=HW/P^2$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. A trainable linear projection is added to project for projection into $D$ dimensions.
Position Embedding
Standard learnable 1D position embeddings are added to the patch embeddings to retain positional information. Dropout is applied after every dense layer except for the the qkv-projections and directly after adding positional- to patch embeddings.
[CLS]
Token
Similar to BERT’s [CLS]
token, a learnable embedding is prepended to the sequence of embedded patches ($z_0^0=x_{cls}$), whose state at the output of the Transformer encoder ($z_{L}^0$) serves as the image representation $y$. Both during pre-training and fine-tuning, a classification head is attached to $z_L^0$. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
The Pytorch implementation of the whole embedding layer is
1 |
|
Encoder
The Transformer encoder in ViT is a little different from the original Transformer. LayerNorm is peformed before each sub-layer and non-linearity in MLP is GELU.
ViT Pytorch Implementation
Code of other modules can be modified from this blog .
1 |
|