Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

📎

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Tags

TLDR논문리뷰

CV

Published

Published May 24, 2021

arXiv link, Cite.GG link

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary?

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

https://arxiv.org/abs/2105.02723

https://cite.gg/#/paper?q=https%3A%2F%2Farxiv.org%2Fabs%2F2105.02723

TL;DR

is the attention layer even necessary?

we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension

a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively.

These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.

Base 기준으로 보면 → 약 24M의 params가 감소.

notion image

근데 애초에 FF도 연산이 느린편이라서 학습 속도 자체에는 큰 차이가 없지 않을까? 라는 생각.

그리고 FF도 Attention처럼 O(N^2)짜리 연산.

대신 모델이 엄청 단순해지겠지.