Differential Transformer

Differential Transformer

Tags
NLP
논문리뷰
PLM
Published
Published October 10, 2024

ArXiv Link

TL;DR

notion image
  • Q → Q1, Q2.
  • K → K1, K2.
  • V → V.
  • SoftMax → SoftMax(Q1K1) - L*Softmax(Q2K2)

Motiv

notion image
  • SNR (Signal - Noise Ratio)
  • Noise Cancelling

Questions on HF

Q: if attention 1 learns S + N_1, and attention 2 learns S + N_2 (where S is signal, N_1, N_2 are different noises), by subtracting these two, the signal S gets canceled while noise becomes N_1 - N_2 which could be more complicated
A: the model knows what signal is and what noise is. Notice that attention_1 and attention_2 are both calculated with learnable parameters, they can "perceive" each other in the training process. Then they can adjust themselves according to each other, to achieve lower loss.
"No, we won't release model weights on hugging face.” from https://huggingface.co/papers/2410.05258#67063cd7dd0bb3a03ba1f0b2
"No, we won't release model weights on hugging face.” from https://huggingface.co/papers/2410.05258#67063cd7dd0bb3a03ba1f0b2
Q: Exciting work! Do the authors plan to release model weights on hugging face?
A: No.

My Idea

Question: Why do we need 2x parmas of Q(Q1, Q2)? → How about using LoRA?
Idea: Instead of using Q1/Q2 and K1/K2, use Q/V from pretrianed model and add lora adapter on Q, K. → assume Q as Q1, and Q+LoRA(Q) = Q2.
notion image

Implementation

[private repo yet]
 
notion image