In a nutshell, ViTs cut the image into patches and perform stacks of self-attention on all the patches, leading to the final feature map, also of lower resolution
Search
In a nutshell, ViTs cut the image into patches and perform stacks of self-attention on all the patches, leading to the final feature map, also of lower resolution