Longformer

Table of Contents

무엇이 문제인가?Transformer & Self-Attention Related work Transformer-XL (ACL 2019)Adaptive Span (ACL 2019)Compressive Reformer Sparse Transformer Routing BP-Transformer Blockwise 어떻게 풀어가는가?Longformer = Windowed Local Attention + Global Attention Windowed Local Attention Sliding Window Dilated Window Global Attention Linear Projection for Global Attention Longformer 구현체 AutoRegressive LM에서의 Longformer AR LM에서의 Attention Pattern AR LM에서의 학습 AR LM에서의 Longformer 성능 Ablation Study Pretraining & Finetuning(MLM)에서의 Longformer MLM방식에서의 Attention Pattern MLM방식에서의 Position Embedding MLM방식에서의 학습 MLM방식에서의 BPC 성능 MLM방식에서의 Downstream task 성능 Longformer-Encoder-Decoder, a.k.a. "LED"Longformer on Transformers 🤗Longformer 학습 Longformer Tokenizer Longformer for Seq Classfication 실습 with IMDB

무엇이 문제인가?

Transformer & Self-Attention

PLM의 성능 향상 배경에는 Transformer가 있음

그 중에서 Self-Attention으로 인해 문장의 길이(N)에 대해 Quadratic( $O(N^2)$ )하게 Computation 및 Memory 소모가 극심함

BERT에서 512 Token으로 max seq를 제한하자 → 이후 대부분의 paper에서도 512로 제한해서 개발

하지만 Sentence Level이 아닌 Document Level에 Transformer를 적용해야하는 경우가 있음 (ex: 문서QA)

Related work

위 이슈를 해결하기 위해 많은 모델이 2019~2020년에 제안됨

Longformer는 2020년 4월 v1이 공개됨

위 LTR(Left-to-Right) 모델은 AutoRegressive 기반 Language Model

Transfer Learning을 위한 모델이 아님!
Bidirection에서 얻는 이점을 얻지 못함

Longformer는 Transformer Layer를 Drop-in replacement 위해 만들어진 모델

Transformer-XL (ACL 2019)

Transformer의 Recurrent한 모델을 제안한 모델

연속된 시그먼트를 모델링 할 때, 각 시그먼트를 독립적으로 모델링(기존의 방식)하는 것이 아니라 특정 시그먼트의 모델링에 이전 시그먼트의 정보(각 layer의 hidden state)를 이용하는 방법 [링크]

여러 시그먼트사이의 의존성도 파악할 수 있게 되어 고정된 길이의 의존성 문제를 해결하게 되고, context fragment 문제 또한 해결

특정 시점(t) 이전의 토큰들(x<t)이 주어졌을 때, t 시점에 등장할 토큰을 예측하는 Language Modeling

학습이 진행되는 동안, 각 시그먼트의 연산 결과들을 다음 시그먼트가 이용할 수 있도록 저장(fixed/cached)

Adaptive Span (ACL 2019)

"Unlike the original Transformer architecture, it uses caching of previous representations and relative position embeddings to better adapt to sequential tasks."

"making it possible to increase the context size to 8k tokens without increasing computation time and memory footprint significantly"

Layer가 올라감에 따라 Attention이 가능한 Span 길이가 위 그림처럼 늘어남

All-attention이라는 것으로 위 그림처럼 Self-Attention과 FF를 합쳐서 서로 Attention 계산을 진행

Compressive

사람이 기억을 하는 것은 전부 기억하는게 아니라, 일부만 선택적으로 기억한다는 것에서 발상

따라서 특정 시기 이후 정보를 Transformer-XL이 Drop하는 반면, Compressive는 압축을 통해 들고 있음

즉 Sequence가 단순히 Hidden vector뿐 아니라 compressed memory위에도 존재

Rare words에 대해 성능이 올라감

Reformer

이전시간에 발표해주심!

Transformer 구조를 살짝 바꿔서 메모리 효율성 👍

Sparse Transformer

Image에 대해 Attention 적용

Stride방식 및 Fixed 방식 제공

Sparse Transformer 참고

Routing

Efficient Content-Based Sparse Attention with Routing Transformers

"Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n1.5d) from O(n2d) for sequence length n and hidden dimension d."

Dynamic Sparse Attenton을 학습

고정된 Attention 대신 유동적인 Attention Pattern을 사용함

Query와 무관한 key에 Attention하는 것을 방지하기 위함

Query, Key들을 K-means clustering에 넣음
오직 같은 cluster에 있는 Query/Key끼리만 Attend함
즉, 매 sequence마다 유동적인 Attention pattern이 생성된다.

BP-Transformer

"BPT yields O(k ·n log(n/k)) connections where k is a hyperparameter to control the density of attention."

Transformer와 GNN을 함께 사용한 모델

Blockwise

Attention Masking(Sparse attention위해)을 우측과 같이 masking matrix를 만들어서 Head별로 다르게 적용함

즉, Head별로 다른 token들에 Attend하게 된다.

어떻게 풀어가는가?

💡

Longformer는 2020.04에 v1 페이퍼가 나왔고, 그 사이 수많은 Long sequence Transformer가 나옴. 이번 리뷰에서는 2020.12에 나온 v2 paper를 기준으로 리뷰함. (따라서 BigBird등 v1 이후 페이퍼들과의 비교도 나옴)

Longformer = Windowed Local Attention + Global Attention

Longformer는 크게 두 가지의 (Full-Attention과는 다른) 문장길이에 따라 $O(N)$ 으로 동작하는 Attention을 사용한다.

Windowed Local Attention

Longformer에서는 위 그림의 b, c, d를 테스트한다.

Sliding Window

특정 위치 n의 토큰은 n-w/2...n+w/2까지만 Attend하는 방식

주로 왼쪽으로 w/2, 우측으로 w/2 만큼의 토큰

고정된 크기의 window size를 사용

아래 Layer에서는 주변 단어 (window size)밖에 보지 못하지만, 높은 Layer에서는 아래 layer는 이미 window size만큼 Attned한 결과물이기 때문에 Final layer에서는 결국 문장 전체를 보는 효과가 된다.(위 그림)

즉, L번째 layer에서는 L*w 만큼의 길이에 Attend하는 효과가 나타난다.

실 Longformer에서는 낮은 layer에는 적은 w → Attention의 Locality

높은 layer에는 큰 w → 긴 Sequence에 대한 이해

이렇게 w 를 Layer별로 다르게 하는 것을 통해 성능을 더 높임

Sliding 구현체

처음 혹은 마지막 부분에서는 window size보다 seq token 갯수가 적다 → Padding 처리


# https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L40
def sliding_chunks_matmul_qk(q: torch.Tensor, k: torch.Tensor, w: int, padding_value: float):
    '''Matrix multiplicatio of query x key tensors using with a sliding window attention pattern.
    This implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer)
    with an overlap of size w'''
    bsz, seqlen, num_heads, head_dim = q.size()
    assert seqlen % (w * 2) == 0
    assert q.size() == k.size()

    chunks_count = seqlen // w - 1

    # group bsz and num_heads dimensions into one, then chunk seqlen into chunks of size w * 2
    q = q.transpose(1, 2).reshape(bsz * num_heads, seqlen, head_dim)
    k = k.transpose(1, 2).reshape(bsz * num_heads, seqlen, head_dim)

    chunk_q = _chunk(q, w)
    chunk_k = _chunk(k, w)

    # matrix multipication
    # bcxd: bsz*num_heads x chunks x 2w x head_dim
    # bcyd: bsz*num_heads x chunks x 2w x head_dim
    # bcxy: bsz*num_heads x chunks x 2w x 2w
    chunk_attn = torch.einsum('bcxd,bcyd->bcxy', (chunk_q, chunk_k))  # multiply

    # convert diagonals into columns
    diagonal_chunk_attn = _skew(chunk_attn, direction=(0, 0, 0, 1), padding_value=padding_value)

    # allocate space for the overall attention matrix where the chunks are compined. The last dimension
    # has (w * 2 + 1) columns. The first (w) columns are the w lower triangles (attention from a word to
    # w previous words). The following column is attention score from each word to itself, then
    # followed by w columns for the upper triangle.

    diagonal_attn = diagonal_chunk_attn.new_empty((bsz * num_heads, chunks_count + 1, w, w * 2 + 1))

    # copy parts from diagonal_chunk_attn into the compined matrix of attentions
    # - copying the main diagonal and the upper triangle
    diagonal_attn[:, :-1, :, w:] = diagonal_chunk_attn[:, :, :w, :w + 1]
    diagonal_attn[:, -1, :, w:] = diagonal_chunk_attn[:, -1, w:, :w + 1]
    # - copying the lower triangle
    diagonal_attn[:, 1:, :, :w] = diagonal_chunk_attn[:, :, - (w + 1):-1, w + 1:]
    diagonal_attn[:, 0, 1:w, 1:w] = diagonal_chunk_attn[:, 0, :w - 1, 1 - w:]

    # separate bsz and num_heads dimensions again
    diagonal_attn = diagonal_attn.view(bsz, num_heads, seqlen, 2 * w + 1).transpose(2, 1)

    mask_invalid_locations(diagonal_attn, w, 1, False)
    return diagonal_attn

Attention이 Overlap 되지 않는 구현체도 있음

Speed: 30% faster than "sliding_chunks"
Memory: 95% of the memory usage of "sliding_chunks"


# https://github.com/allenai/longformer/blob/master/longformer/sliding_chunks.py#L150
def sliding_chunks_no_overlap_matmul_qk(q: torch.Tensor, k: torch.Tensor, w: int, padding_value: float):
    bsz, seqlen, num_heads, head_dim = q.size()
    assert seqlen % w == 0
    assert q.size() == k.size()
    # chunk seqlen into non-overlapping chunks of size w
    chunk_q = q.view(bsz, seqlen // w, w, num_heads, head_dim)
    chunk_k = k.view(bsz, seqlen // w, w, num_heads, head_dim)
    chunk_k_expanded = torch.stack((
        F.pad(chunk_k[:, :-1], (0, 0, 0, 0, 0, 0, 1, 0), value=0.0),
        chunk_k,
        F.pad(chunk_k[:, 1:], (0, 0, 0, 0, 0, 0, 0, 1), value=0.0),
    ), dim=-1)
    diagonal_attn = torch.einsum('bcxhd,bcyhde->bcxhey', (chunk_q, chunk_k_expanded))  # multiply
    return diagonal_attn.reshape(bsz, seqlen, num_heads, 3 * w)

Dilated Window

Sliding Window와 동일한 Computational cost를 유지하면서 & Receptive Field를 늘리는 방법

Attend를 할 때, 곧바로 바로 옆이 아닌 몇 칸씩 건너뛴 토큰들에 Attend하자!

물론 중간에 '비는 토큰'은 Attend하지 못하지만, 다음 layer(혹은 그 위의 Layer)에서 Attend하게 됨

만약 칸 칸 사이가 d 만큼 빈다면..

앞서 빈칸이 없던 경우는 L * w 만큼을 Attend할 수 있었지만,
빈칸이 있는 경우 총 L * d * w 길이 만큼을 모델이 Attend 할 수 있다.

또한, Multi-head Attention에서 각 Head 별로 다른 d 값을 제공할 경우 → 모델 성능이 더 올라간다.

Global Attention

Classification과 같은 특정 task를 위해서는 본문 전체에 Attend하는 '특수' 토큰들(미리 지정해둔 위치)이 있음

Classifciaton의 경우 [CLS] 토큰 등

Symmetric Attention Pattern을 도입

즉 CLS토큰은 나머지 전체와 Attend 하고
나머지 모든 토큰 역시 CLS 토큰을 Attend 하는 것

여기서 '특수 토큰'의 갯수는 전체 Sequence에 비해 매우매우 적기 때문에 여전히 Complexity는 O(N)

Linear Projection for Global Attention

원래 transformer의 attention은 Q, K, V를 사용한다.

여기서는 Q_s, K_s, V_s를 통해 Sldiing window attention을, Q_g, K_g, V_g를 통해 Global Attention을 계산한다.

즉 Attention이 두 종류가 되는 셈

다만 initalize는 동일하게 사용


class LongformerSelfAttention(nn.Module):
    def __init__(self, config, layer_id):
        [...생략...]
        self.query = nn.Linear(config.hidden_size, self.embed_dim)
        self.key = nn.Linear(config.hidden_size, self.embed_dim)
        self.value = nn.Linear(config.hidden_size, self.embed_dim)

        self.query_global = nn.Linear(config.hidden_size, self.embed_dim)
        self.key_global = nn.Linear(config.hidden_size, self.embed_dim)
        self.value_global = nn.Linear(config.hidden_size, self.embed_dim)
				[...생략...]

Longformer 구현체

Loop

Chunks

CUDA → Longformer 저자 구현체

대체로 Seq len에 대해 Time, memory가 Linear하게 증가함

💡

현재 PyTorch에 맞게 구현된 CUDA 커널에서 Dilated window 지원 X (Finetune에는 필요 없음)

AutoRegressive LM에서의 Longformer

AR LM에서의 Attention Pattern

Dialated Sliding Window pattern을 사용함

window size를 layer별로 다르게 설정

Low Layer: 작은 window size
High Layer: 큰 window size

이를 통해 top layer에서 High-level representation을 학습하고, lower layer에서는 Local information을 학습

Lower layer에서는 dilated sliding window를 사용하지 않음

dilated sliding을 사용시 Local context에 대한 학습이 어려움

Higher layer에서는 dilate를 위 2개 layer에 '작은 값'으로만 적용함

AR LM에서의 학습

Character Level Language Model에 집중

Text8, enwiki8 데이터셋으로 평가

초기에 많은 양의 Gradient update로 Local context를 먼저 이해해야 함

이를 위해 Staged Training을 도입

초기 Phase: short sequence/window size/sequence length
후반으로 갈수록 window size x2, sequence length x2, lr /2

총 5번의 Phase를 사용

2048 Sequence부터 시작해서 → 23,040의 Seqnce로 학습 마무리

AR LM에서의 Longformer 성능

작은 모델 기준 text8, enwiki8 모두에서 SoTA를 달성

큰 모델에서는 성능이 조금 낮지만, 다른 모델의 크기가 2배정도라는 것을 감안해야 함

또한, 다른 모델들은 Pretrain-Finetune 패러다임에 적합하지 않다는 것도 Note

Ablation Study

아래에서 위로 갈수록 더 큰걸 보게 하는 Attention design이 맞다는 것을 증명하기 위해

거꾸로, 아래가 더 넓게 보고 위에 좁게 보는 경우 → 성능이 가장 낮음
모두 같게 하는 경우가 그 사이(평균) 수준

Dilation의 효과는 Top-2 layer 한 것이 안 한 것보다 오히려 더 성능이 좋음

Pretraining & Finetuning(MLM)에서의 Longformer

BERT와 같은 MLM 계열 등

여러 Downstream task에 적용하기 위한 방법

Longformer로 총 4096 seq length(BERT-512의 8배)

Longformer를 MLM으로 학습

처음부터 하면 너무 비싸서, RoBERTa의 CKPT를 사용

아래 Notebook 코드 기반으로 RoBERTa를 Longformer로 변환

allenai/longformer

Longformer: The Long-Document Transformer. Contribute to allenai/longformer development by creating an account on GitHub.

https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb


def create_long_model(save_model_to, attention_window, max_pos):
    model = RobertaForMaskedLM.from_pretrained('roberta-base')
    tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', model_max_length=max_pos)
    config = model.config

RoBERTa LM을 가져와 모델 생성을 시작함

MLM방식에서의 Attention Pattern

Sliding window를 사용

window size = 512

RoBERTa와 동일한 Computing cost

MLM방식에서의 Position Embedding

토큰 길이가 512 → 4096으로 늘어났으니 당연히 비는 부분이 생김

처음부터 Pos emb를 학습하는 대신, RoBERTa의 Pos emb를 갖다 복사 해서 사용!

BERT의 강력한 성능은 Local context에 대한 이해이기 때문에, Copy initialize는 Local structure를 유지

훨씬 빠른 속도로 학습 Converge가 이루어짐


# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
    new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
    k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
model.roberta.embeddings.position_ids.data = torch.tensor([i for i in range(max_pos)]).reshape(1, max_pos)

위와 같이 Tokenizer, Model에서의 max sequence length를 늘려준다.

이와 함께 new pos emb로 RoBERTa를 복사해서 새로운 new_pos_embed 를 만든다.


# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
    longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
    longformer_self_attn.query = layer.attention.self.query
    longformer_self_attn.key = layer.attention.self.key
    longformer_self_attn.value = layer.attention.self.value

    longformer_self_attn.query_global = copy.deepcopy(layer.attention.self.query)
    longformer_self_attn.key_global = copy.deepcopy(layer.attention.self.key)
    longformer_self_attn.value_global = copy.deepcopy(layer.attention.self.value)

    layer.attention.self = longformer_self_attn

이후 Attention Layer를 Longformer로 변환해 주는 방식으로 새 모델을 생성한다.

이때 Query Key Value와 함께 Global Q,K,V를 함께 (복사) 생성한다.

MLM방식에서의 학습

FairSeq를 이용한 학습 진행

RoBERTa 모델의 ckpt를 받아 학습 재개

base model
large model

각각 모두 65K Gradient update(step) + Seq len 4069 + batch size 64

MLM방식에서의 BPC 성능

학습 진행시 MLM BPC 측정시 Random init보다 Pos emb 복사하는 것 만으로도 성능이 엄청나게 올라감

"copy position embeddings" 부분

Gradient update(추가 MLM Pretrain 진행) 후 성능이 추가적으로 향상됨

RoBERTa weight를 얼리고 Pos emb만 학습시 → 마지막 부분의 1.850 성능이 나옴. 하지만 최적은 아님!

즉 최고의 성능 위해서는 전체 Model을 학습하는게 더 좋음

MLM방식에서의 Downstream task 성능

→ Non-GNN 모델 중에서 최고!

WikiHop (acc), SoTA 🌟

TriviaQA (F1), SoTA 🌟

위 2가지는 BERT에서 사용한 것과 동일한 방식으로 Finetune

Question, Document를 하나의 긴 Sequence로 합치기
이후 Dataset에 맞는 Prediction layer를 추가해 학습

WikiHop: Classification
TriviaQA: custom loss function

HotpotQA (Joint F1)

HotpotQA는 multi-hop task
10개 문장 중 필요한 문장 찾기 (2개만 관련있고, 나머지 8개는 무관한 문장)
문장 선택 후 답 Extraction
비록 Longformer가 SoTA는 아니지만, 모델링 구조가 훨씬 간단함

Coreference Resolution

OntoNotes (avg F1)

RoBERTa를 단순히 Longformer로 대체해서 학습

Global Attention은 사용하지 않음

Document Classification

IMDB (acc)

Hyperpartisan (F1)

CLS 토큰 사용한 Global Attention 적용

두 task 모두 '긴 문장'이 없어서 큰 효과가 없음

Longformer-Encoder-Decoder, a.k.a. "LED"

💡

2020년 4월 버전1에는 없는, 2020년 12월 버전2에 추가됨!

기존의 Transformer가 seq2seq였던것 과는 다르게, 최근에는 Encoder만 쓰는 모델이 많이 나옴

BART, T5와 같은 Seq2seq 모델도 나옴

하지만 여전히 512로 Sequence length에 제약이 큼

Longformer를 이용한 Encoder-Decoder Model을 만들어보자!

Encoder → Local + Global Attention

Decoder → Full Self Attention (with Previous decoded locations)

이것이 바로 LED

BART의 Architecture를 그대로 적용. Transformer 부분만 longformer로 대체

Pos emb를 1K에서 16K token으로 확장
이것 역시 Pos emb를 16번 복사 & 붙여넣기로 확장

ArXiv Summarization task로 테스트

90th% length = 14.5K token (충분히 길다!)

Encoder는 Local Attention (with 1024 window size) + 첫 <s> 토큰에 Global attention

Decoder는 Encoder 전체에 Attention + 현재까지 decode된 토큰에 Attention

Teacher-forcing 방식으로 학습

Beam-Search로 Inference

Additional Pretrain 없이(단순 BART 복사 붙여넣기)도 task에서 SoTA.

BigBird보다 성능이 높다!
(BigBird는 Prgasus에 추가로 pretrain한 모델. Summarization 전용 모델이기도 함)

단순히 Longer sequence 처리능력이 생긴 것 만으로도 성능이 확 올라감

💡

현재 Longformer 공식 Github을 보면 T5 모델에 Longformer를 적용하는 실험을 하는 것으로 보임

Longformer on Transformers 🤗

(다행이도) Longformer는 Huggingface Transformer 라이브러리에 구현체가 있다! (LED도 있다)

Longformer 학습

Longformer는 RoBERTa 학습과 같이, MLM을 통해 학습을 진행할 수 있다.

Input에는 MASK, 정답은 그대로를 둔 뒤 Loss를 통해 학습할 수 있다.

Longformer Tokenizer

Longformer는 RoBERTa와 동일한 Vocab을 사용한다. 따라서 Tokenizer도 동일하다.

공식 Longformer github에서도 아래와 같이 RobertaTokenizer 를 사용한다.


import torch
from longformer.longformer import Longformer, LongformerConfig
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer

config = LongformerConfig.from_pretrained('longformer-base-4096/') 
# choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
# 'n2': for regular n2 attantion
# 'tvm': a custom CUDA kernel implementation of our sliding window attention
# 'sliding_chunks': a PyTorch implementation of our sliding window attention
config.attention_mode = 'sliding_chunks'

model = Longformer.from_pretrained('longformer-base-4096/', config=config)
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
tokenizer.model_max_length = model.config.max_position_embeddings

Longformer for Seq Classfication 실습 with IMDB

PyTorch 1.8.0

PyTorch-Lightning 1.3.0

Google Colaboratory

https://colab.research.google.com/drive/1RTNLBGsqI1SAANr1ytYhAcPMa3QBSoBO?usp=sharing