DeBERTa: Decoding-enhanced BERT with Disentangled Attention

💡

DeBERTa 리비전 중 v5로 리뷰함

Abstract DeBERTa Disentangled Attention Enhanced Mask Decoder accounts for Absolute Word Positions Scale Invariant Fine-Tuning Virtual Adversarial training Experiments Large Model Train Config Performance Base Model Train Config Performance Model Analysis: Why better performance?Ablation Study Model Scale up to 1.5B 끗.

Abstract

DeBERTa!

Decoding-Enhanced BERT with disentangled Attention

DeBERTa는 크게 두 요소 (라고 주장하지만 내가 볼땐 3가지다)

Token Vector = (Content Embedding, (relative) Positional Embedding)

MLM Prediction에 Absolute position 정보를 추가로 제공

SiFT: Finetune 단계에 쓰는 Virtual Adversarial Training

첫 발표시점(2020년 6월) 기준 GLUE, SuperGLUE에서 SoTA찍고 사람보다 더 높은 성능을 보임

DeBERTa

Disentangled Attention

A Two-vector approach to Content and Position embedding

문장 Sequence 내 특정 position i 에 있는 토큰 = Content Vector , (Relative)Positional vector (* j 는 다른 토큰의 position)

위 상태에서 2개의 토큰(i, j) 사이의 Attention을 구하는 방법은 두개를 곱! = 4개의 행렬곱 합

각각 content-to-contet, content-to-position, position-to-content, position-to-position 이지만..

여기서는 Relative Pos emb를 사용하기 때문에 pos emb간 attend하는 것은 의미가 없음 → 사용 안함

기존의 Attention(is all you need):

여기서 Relative position을 위한 정의를 하나 끼얹고...

Contents vector, Positional Vector를 각각 Qc Kc Vc Qr Kr 로 나타내면... (*여기서 Vr은 애초에 필요가 없음! Attention score 계산시 포함되지 않기 때문. Vc는 softmax 계산 이후 곱)

그리고 각각(c to c, c to p, p to c)에 대해 Attention 값인 를 계산해준다.

는 기존의 어텐션처럼 로 하면 끝 (3번 줄)

는 각 위치(position)에 위 값을 의 i번째 토큰값 * 으로 계산(content-to-position)해 준 값에서 Rel pos를 구하고 (4번줄) Relative position 대상으로 실제 값을 처리한다. (8번째 줄)

위와 동일한 방향으로 positional to content Attention도 같이 계산 (12,16번 줄)

마지막에 C-C, C-P, P-C Attention Sum한 값을 최종 Attend로 삼고 Hidden vector 생성 (Attention Layer output)

이때, 원래는 각 토큰에 대한 Relative pos emb를 구하려면 → 만큼 Space가 필요하지만

모든 Relative position이 내에 속하기 때문에 → 값만 저장하고 재사용한다.

위 Line 3~5에서처럼 매 번 Q*K 곱을 통해 Relative position 값을 계산해서 추출해서 쓰는 방법
매번 계산을 해주지만 대신 메모리 사용 매우 감소

Enhanced Mask Decoder accounts for Absolute Word Positions

DeBERTa는 기본적으로 MLM 으로 학습하는 모델이지만, MLM에 있어서 Relative Positional Embedding 만 사용하는 것은 여러 이슈를 가져온다.

"A new store opened beside the new mall"

위 new 가 동일한 상황이라, 모델 입장에서 store 와 mall 모두 같은 수준의 Prediction 대상

하지만 문장 순서나 주제를 고려하면 store 가 메인이라, 두 순서가 바뀌면 곤란

따라서 Absolute position을 고려해야 함.

Absolute Position을 모델에게 전해주는 방법은 크게 두 가지

BERT처럼, word emb를 전달할 때 애초에 합쳐서 전달해 주는 방법 (Input layer에서 추가)

DeBERTa에서는 MLM softmax 취하기 직전에 값을 넣어서 합치는 방법 (Output layer 직전에 추가)

Called "Enhanced Mask Decoder(EMD)"

(당연히도) EMD가 1.번 방법보다 더 성능이 잘 나온다.

논문에서는 BERT보다 DeBERTa가 성능이 더 잘나오는 이유 중 하나가, Relative position info가 BERT에 부족하기 때문이 아닌지 의문을 제기한다.

또, 이때 Absolute position같은 '기타 정보'를 모델에 추가로 전달해 주는 것 역시 가능!

Scale Invariant Fine-Tuning

Virtual Adversarial Training 알고리즘

Virtual Adversarial training

모델의 Generalization 성능을 높이기 위해 쓰는 Regularization 방법론

Model에 넣는 Input을 Perturbation!

NLP에서는 단어 자체에 변화를 주면 → Tokenizer/임베딩 자체가 깨져버리니까 X

그래서 Embedding된 Vector에 Perturbation!

하지만 Embedding이 온전한 구 형태가 아니기 때문에
단어별로 perturbation 해줘야 하는 Variance가 너무 큼
모델이 커지면 커질수록 Variance는 더 커짐

위 문제를 해결하기 위해 제시된 것이 SiFT! = Normalized Word Embedding에 Perturbation을 해주자

Normalize word emb → Stochastic vectors

Perturbation to normalized vectors

단, 논문에서 실험은 DeBERTa에만 적용함 (← 아마도 작은 모델들에서는 차이가 없어서?)

Experiments

모델 학습은 Base/Large/그 이상으로 학습

Vocab은 WordPiece대신 BPE로 학습

사용 데이터셋

English Wikipedia: 12GB
Book Corpus: 6GB
OPENWEBTEXT: 38GB
CommonCrawl STORIES: 31GB
다 합치고 중복 제거해서 78GB

Large Model

Train Config

Large Model은 BERT-Large와 거의 비슷한 Config (24 layers, 1024 hidden size)

96장의 V100 (DGX-2 6대)

2k batch size

1M step

20 days

Performance

GLUE Benchmark

(당연히) DeBERTa가 전반적으로 다 높은 성능을 보여줌

데이터 효율성도 좋음

RoBERTa, XLNET, ELECTRA가 160GB text로 학습
DeBERTa는 78GB(거의 절반)

학습 효율도 좋음

RoBERTa, XLNET은 500k step * 8k sampes in step = 4B train step
DeBERTa 1M step * 2k samples in step = 2B train step(절반)

MNLI Benchmark

MNLI에서도 좋은 성능 보여준다.

Base Model

Train Config

BERT base와 비슷한 Config(L=12, H=768, A=12)

64장의 V100 (DGX-2 4대)

2K(2048) batch size

1M step 하는데 10days

데이터셋은 Large와 동일

Performance

MNLI Benchmark

3가지 Task에 대해서만 진행 (GLUE 하지 않았음)

세가지 종류 다 기존(RoBERTa, XLNET)보다 더 성능이 더 잘 나옴

Model Analysis: Why better performance?

Ablation Study

EMD를 빼보자

C2P를 빼보자

P2C를 빼보자

RoBERTa는 공정한 비교를 위해 같은 수준으로 다시 학습한 모델

각각을 뺄 때 마다 성능 저하가 조금씩 발생

여러개 빼면 (당연히?) 더 많은 성능 저하 발생

즉 각각이 모두 다 뭔가 역할을 하고 있다고 볼 수 있음

Model Scale up to 1.5B

DeBERTa는 1.5B scale의 거대 모델도 제공한다.

→ 48 Layers, 1536 Hidden, 24 Att Heads

동일한 데이터셋

Vocab은 128K개로 확장(새로 만든 보캡)

모델 사이즈가 커지면서 효율성을 고려한 여러가지 방법 제시

Relative position embedding위한 Projection Matrix 공유 → 간 공유

모델 사이즈는 줄지만 성능 하락 거의 없음 (table 13 참고)

1st Transformer layer에 Convolution layer 추가 → N-gram 단위의 Sub-word 잘 포착할 수 있게

1.5B 모델은 SuperGLUE에서 최초로 Human score 뛰어넘음!(Single Model 기준)

T5모델은 11B로 엄청 커서, 모델 크기는 거의 8배 차이인데 성능 비슷