Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Published May 24, 2021

arXiv link, Cite.GG link


  • is the attention layer even necessary?
  • we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension
  • a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively.
  • These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought.
  • Base 기준으로 보면 → 약 24M의 params가 감소.
notion image
  • 근데 애초에 FF도 연산이 느린편이라서 학습 속도 자체에는 큰 차이가 없지 않을까? 라는 생각.
    • 그리고 FF도 Attention처럼 O(N^2)짜리 연산.
  • 대신 모델이 엄청 단순해지겠지.