• Post category:StudyBullet-14
  • Reading time:7 mins read


Transformers in Computer Vision – English version

What you will learn

What are transformer networks?

State of the Art architectures for CV Apps like Image Classification, Semantic Segmentation, Object Detection and Video Processing

Practical application of SoTA architectures like ViT, DETR, SWIN in Huggingface vision transformers

Attention mechanisms as a general Deep Learning idea

Inductive Bias and the landscape of DL models in terms of modeling assumptions

Transformers application in NLP and Machine Translation

Transformers in Computer Vision

Different types of attention in Computer Vision

Description

Transformer Networks are the new trend in Deep Learning nowadays. Transformer models have taken the world of NLP by storm since 2017. Since then, they become the mainstream model in almost ALL NLP tasks. Transformers in CV are still lagging, however they started to take over since 2020.

We will start by introducing attention and the transformer networks. Since transformers were first introduced in NLP, they are easier to be described with some NLP example first. From there, we will understand the pros and cons of this architecture. Also, we will discuss the importance of unsupervised or semi supervised pre-training for the transformer architectures, discussing Large Scale Language Models (LLM) in brief, like BERT and GPT.

This will pave the way to introduce transformers in CV. Here we will try to extend the attention idea into the 2D spatial domain of the image. We will discuss how convolution can be generalized using self attention, within the encoder-decoder meta architecture. We will see how this generic architecture is almost the same in image as in text and NLP, which makes transformers a generic function approximator. We will discuss the channel and spatial attention, local vs. global attention among other topics.


Get Instant Notification of New Courses on our Telegram channel.


In the next three modules, we will discuss the specific networks that solve the big problems in CV: classification, object detection and segmentation. We will discuss Vision Transformer (ViT) from Google, Shifter Window Transformer (SWIN) from Microsoft, Detection Transformer (DETR) from Facebook research, Segmentation Transformer (SETR) and many others. Then we will discuss the application of Transformers in video processing, through Spatio-Temporal Transformers with application to Moving Object Detection, along with Multi-Task Learning setup.

Finally, we will show how those pre-trained arcthiectures can be easily applied in practice using the famous Huggingface library using the Pipeline interface.

English
language

Content

Introduction

Introduction

Overview of Transformer Networks

The Rise of Transformers
Inductive Bias in Deep Neural Network Models
Attention is a General DL idea
Attention in NLP
Attention is ALL you need
Self Attention Mechanisms
Self Attention Matrix Equations
Multihead Attention
Encoder-Decoder Attention
Transformers Pros and Cons
Unsupervised Pre-training

Transformers in Computer Vision

Module roadmap
Encoder-Decoder Design Pattern
Convolutional Encoders
Self Attention vs. Convolution
Spatial vs. Channel vs. Temporal Attention
Generalization of self attention equations
Local vs. Global Attention
Pros and Cons of Attention in CV

Transformers in Image Classification

Transformers in image classification
Vistion Transformers (ViT and DeiT)
Shifted Window Transformers (SWIN)

Transformers in Object Detection

Transformers in Object detection
Obejct Detection methods review
Object Detection with ConvNet – YOLO
DEtection TRansformers (DETR)
DETR vs. YOLOv5 use case

Transformers in Semantic Segmentation

Module roadmap
Image Segmentation using ConvNets
Image Segmentation using Transformers

Spatio-Temporal Transformers

Spatio-Temporal Transformers – Moving Object Detection and Multi-trask Learning

Huggingface Vision Transformers

Module roadmap
Huggingface Pipeline overview
Huggingface vision transformers
Huggingface Demo using Gradio

Conclusion

Course conclusion

Material

Slides