Differential Transformer

ArXiv Link

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise....

https://arxiv.org/abs/2410.05258

TL;DR

Q → Q1, Q2.

K → K1, K2.

V → V.

SoftMax → SoftMax(Q1K1) - L*Softmax(Q2K2)

Motiv

SNR (Signal - Noise Ratio)

Noise Cancelling

Questions on HF

QA from https://huggingface.co/papers/2410.05258#67058502e3706295aa6975dc

Q: if attention 1 learns S + N_1, and attention 2 learns S + N_2 (where S is signal, N_1, N_2 are different noises), by subtracting these two, the signal S gets canceled while noise becomes N_1 - N_2 which could be more complicated

A: the model knows what signal is and what noise is. Notice that attention_1 and attention_2 are both calculated with learnable parameters, they can "perceive" each other in the training process. Then they can adjust themselves according to each other, to achieve lower loss.

"No, we won't release model weights on hugging face.” from https://huggingface.co/papers/2410.05258#67063cd7dd0bb3a03ba1f0b2

Q: Exciting work! Do the authors plan to release model weights on hugging face?

A: No.

My Idea

Question: Why do we need 2x parmas of Q(Q1, Q2)? → How about using LoRA?

Idea: Instead of using Q1/Q2 and K1/K2, use Q/V from pretrianed model and add lora adapter on Q, K. → assume Q as Q1, and Q+LoRA(Q) = Q2.

Implementation

[private repo yet]