VIT(Vision Transformers)

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Abstract

-Limited application of Transformer architecture to computer vision tasks compared to NLP
Traditionally, when utilized for computer vision, attention mechanisms were integrated with or served as replacements for specific parts of convolutional neural networks (CNNs)
-Objective of the Thesis
Challenges the necessity of coupling Transformers with CNNs for image tasks
Purely Transformer-based approach, which processes sequences of image patches directly, can effectively handle image classification tasks without the need for convolutional operations
-VIT stages of training
(1)Pre-trained on large datasets
(2)Capabilities extended to various medium and small-sized image recognition benchmarks through transfer learning
-Performance of VIT
Performance of ViT is noteworthy, achieving excellent results that often surpass those of advanced CNNs while being more efficient in terms of computational resources required for training

Introduction

NLP
-In the domain of NLP, the utilization of Transformers, which are based on self-attention mechanisms, remains predominant
-Process
Pre-trained on vast corpora of text, which is then followed by fine-tuning on smaller, task-specific datasets
-Appeal of Transformers
Structural advantages offer computational efficiency and scalability
Has even enabled the training of extraordinarily large models, some with over 100 billion parameters, marking a significant advancement in the field’s capabilities

Compter Vision
-Convolutional architectures remain dominant
-Approaches to use Transformer Architecture
(1)Try combining CNN-like architectures with self-attention
(2)Replacing the convolutions entirely
Problem:
Have not yet been scaled(like GPUs and TPUs) effectively on modern hardware accelerators, which are crucial for performing large-scale computations efficiently due to the use of specialized attention patterns. Unlike convolutions, which have been highly optimized over the years for parallel processing on such hardware, the attention mechanisms often involve more complex, irregular computations that do not map as efficiently onto these architectures

VIT Experiment Process
Streamlined Explanation of the process
Apply a standard Transformer directly to images with fewest modifications
(1)Image Patching
:Divide the input image into smaller, fixed-size segments called patches, akin to cutting a large picture into a grid of smaller squares
-Patches are analogous to tokens in NLP, serving as the basic units for processing.
-Similar to how you might divide a large picture into a grid of smaller squares
-Size of these patches is a crucial parameter, influencing the model’s performance
-Example: splitting a 224x224 pixel image into 16x16 pixel patches results in 196 (14x14) patches
(2)Patch Embedding and Processing:
:Provide the sequence of linear embeddings of these patches as an input to a Transformer
(2)-1 Linear embeddings

Each patch is then flattened(converted from a 2D array of pixels to a 1D array) and passed through a linear layer (also known as a fully connected layer)
This step transforms the raw pixel values into more meaningful representations, potentially capturing features like edges, textures, or more complex patterns, enhancing their suitability for deep learning models
(2)-2 Sequence Formation
The embeddings are treated as a sequence
Key aspect because Transformers, originally developed for NLP tasks, are designed to process sequential data
(2)-3 Input to a Transformer
Sequence of embeddings is fed into a Transformer model
Transformer uses self-attention mechanisms to weigh the importance of different patches relative to each other for the task at hand (e.g., image classification)
Can learn to focus more on the embeddings of patches that contain critical information (like the object of interest) and less on those that are less relevant (like the background)
(2)-4 Model Training
Model is trained on image classification tasks in a supervised manner, learning to accurately identify and categorize images based on the training data provided

VIT trained in mid-sized datasets
When Vision Transformer (ViT) models are trained on mid-sized datasets like ImageNet without the application of strong regularization techniques. Under these conditions, the models tend to achieve accuracies that are slightly lower, by a few percentage points, compared to ResNet models of comparable size
Transformers lack some of the inductive biases inherent to CNNs
-translation equivariance
If you shift the input (e.g., an image), the output shifts in the same way
CNNs naturally possess this trait due to the way convolutions are applied across an image, ensuring that the network’s perception of objects is consistent regardless of their position in the image
-locality
CNNs process images in a way that gives priority to local features before integrating them into a more global understanding
Achieved through the use of small filters that operate on local patches of an image, capturing details like edges and textures before later layers aggregate this information to recognize more complex patterns
-Because Transformers do not inherently have these biases, they may not generalize as effectively when trained on smaller amounts of data
CNNs, with their built-in biases towards translation equivariance and locality, might require less data to learn comparable generalizations

VIT trained in larger datasets(14M-300M images)
Large scale training trumps inductive bias
-VisionTransformer(ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints
-Example:
Pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks
Best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks

Naive application of self-attention to images would require that each pixel attends to every other pixel
Self-Attention Mechanism
In the context of image processing, the self-attention mechanism enables each pixel in an image to interact with every other pixel to capture global dependencies. This means that for any given pixel, the mechanism calculates its response by considering the features of all other pixels in the image
Quadratic Computational Cost
Quadratic Relationship:
The term “quadratic cost” refers to how the computation required for this process scales with the number of pixels in the image
If an image has N pixels, the computational cost for self-attention scales as the computational cost for self-attention scales as $N^2$
This is because each of the N pixels needs to be compared with every other pixel, resulting in N×N Comparisons
Example to understand scaling:
-Small image of size 100x100 pixels
100 * 100 = 10000
We have 10000 number of pixels
In terms of comparisons, would be $10,000^2 = 100,000,000$(100 million comparisons)
-Larger, realistic image of size 1000x1000 pixels
1000 * 1000 = 1000,000
In terms of comparisons, would be $1000,000^2 = 1000,000,000,000$(1 trillion comparisons)
Why It’s Impractical for Large Images
Massive Computation:
As the image size increases, the number of computations required increases quadratically, quickly becoming impractical for real-world applications. Modern cameras and applications often use high-resolution images, where the number of pixels can easily reach into the millions. Processing such images with traditional self-attention mechanisms would require an enormous amount of computational resources, significantly slowing down the processing time and making it infeasible for applications that require real-time or near-real-time performance
Memory Constraints:
Beyond just the processing power, the quadratic increase in computations also implies a significant increase in memory requirements, as the attention scores for each pixel pair need to be stored temporarily. This can quickly exceed the memory capacity of even the most advanced hardware accelerators

Specialized Attention Architectures

:Demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators
Local Self-Attention
Idea:
Instead of having each pixel attend to every other pixel globally, some approaches apply self-attention within local neighborhoods
This reduces the computational burden significantly
Impact:
These local self-attention mechanisms can replace conventional convolutions in neural networks, demonstrating that they can effectively process images while reducing computational cost
Sparse Transformers
Idea: Employ scalable approximations to global self-attention, making it feasible to handle images
This approach selectively focuses on parts of the image, reducing the overall computation needed
Varying Block Sizes
Idea: Apply attention in blocks of varying sizes or, in some extreme cases, only along individual dimensions of the image (like width or height)
This method further scales down the computation by limiting the scope of attention
Cordonnier et al. (2020) Model
Most related to the VIT Work
-Patch Extraction:
Model processes images by first extracting small patches of size 2×2 pixels from the input image
Meaning the image is divided into blocks, each containing a 2×2 grid of pixels
-Full Self-Attention on Patches:
After extracting these patches, the model applies full self-attention within each patch
This allows each pixel within a patch to attend to every other pixel in the same patch, capturing local dependencies
-Limitation:
The use of a small patch size (2×2) limits the model’s applicability to small-resolution images
This is because larger images, when broken down into such small patches, would result in a vast number of patches, making the processing complex and potentially less effective at capturing more global image features
This Work’s Contribution
-Focus on Transformer Models:
Unlike the prior works that primarily focused on ResNet-based models (a specific architecture of CNNs), this research explores training Transformer models on these large-scale datasets. Transformers, originally designed for natural language processing tasks, have been adapted for image recognition tasks (e.g., through models like Vision Transformer, ViT)
-Datasets Used:
Specifies that the research focuses on ImageNet-21k and JFT-300M datasets for training Transformers
ImageNet-21k is an extension of the standard ImageNet dataset, containing over 14 million images and 21,000 categories
JFT-300M is an even larger dataset with approximately 300 million images and 18,000 categories
These datasets provide a much richer source of visual information, potentially enabling the trained models to achieve superior performance on image recognition tasks
-This work contributes to the growing body of research on image recognition by demonstrating the benefits of training image recognition models, specifically Transformers, on datasets much larger than the traditional ImageNet.
By leveraging these vast datasets, the study aims to achieve state-of-the-art results, further pushing the boundaries of what’s possible in image recognition. The focus on Transformers represents a novel approach in the context of large-scale image recognition, moving beyond the more commonly used CNN architectures to explore how different model architectures scale with dataset size and how they perform with transfer learning strategies

Method

Followed the original Transformer architecture
Advantage of this intentionally simple setup is that scalable NLP Transformer architectures– and their efficient implementations– can be used almost out of the box

VIT Structure
Model overview
(1)Split an image into fixed-size patches
(2)Linearly embed each of them
(3)Add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder
(4)Use the standard approach of adding an extra learnable “classification token” to the sequence in order to perform classification

Conclusion

-What the thesis mainly covers:
Employing Transformers architecture directly to Image Recognition
-Prior works used self-attention in computer vision, but apart from the initial patch extraction step, image-specific inductive biases are not introduced
-How Transformer architecture’s powerful capabilities for processing sequential data in NLP can be adapted for image processing:
Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP
This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets
-Performance:
Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train
-Challenges:
(1)How to apply ViT to other computer vision tasks
Example: Detection and segmentation
-Promising results expected
DETR (DEtection TRansformer)(Also applies the Transformer architecture to computer vision tasks, specifically for object detection)
(2)Performance of self-supervised pre-training
Initial experiments do show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining
-Expectations:
Further scaling of ViT would likely lead to improved performance

Footnote:
Attention mechanisms:
Allow models to focus on specific parts of the input data that are most relevant for the task at hand