ArXiv Link
TL;DR
- Q → Q1, Q2.
- K → K1, K2.
- V → V.
- SoftMax → SoftMax(Q1K1) - L*Softmax(Q2K2)
Motiv
- SNR (Signal - Noise Ratio)
- Noise Cancelling
Questions on HF
Q: if attention 1 learns S + N_1, and attention 2 learns S + N_2 (where S is signal, N_1, N_2 are different noises), by subtracting these two, the signal S gets canceled while noise becomes N_1 - N_2 which could be more complicated
A: the model knows what signal is and what noise is. Notice that attention_1 and attention_2 are both calculated with learnable parameters, they can "perceive" each other in the training process. Then they can adjust themselves according to each other, to achieve lower loss.
Q: Exciting work! Do the authors plan to release model weights on hugging face?
A: No.
My Idea
Question: Why do we need 2x parmas of Q(Q1, Q2)? → How about using LoRA?
Idea: Instead of using Q1/Q2 and K1/K2, use Q/V from pretrianed model and add lora adapter on Q, K. → assume Q as Q1, and Q+LoRA(Q) = Q2.
Implementation
[private repo yet]