arXiv link, Cite.GG link
TL;DR
- is the attention layer even necessary?
- we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension
- a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively.
- These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.
- Base 기준으로 보면 → 약 24M의 params가 감소.
- 근데 애초에 FF도 연산이 느린편이라서 학습 속도 자체에는 큰 차이가 없지 않을까? 라는 생각.
- 그리고 FF도 Attention처럼 O(N^2)짜리 연산.
- 대신 모델이 엄청 단순해지겠지.