Prompt Tuning for Generative Multimodal Pretrained Models
๐Ÿช

Prompt Tuning for Generative Multimodal Pretrained Models

Tags
๋…ผ๋ฌธ๋ฆฌ๋ทฐ
NLP
TLDR๋…ผ๋ฌธ๋ฆฌ๋ทฐ
Published
Published August 5, 2022
  • Alibaba
  • Submitted on 4 Aug 2022
    • Comments: Work in progress

Paper TL;DR

notion image
๐Ÿ’ก
์ผ๋ฐ˜์ ์ธ ์ˆ˜์ค€์˜ Prompt Tuning, ํ•˜์ง€๋งŒ Visual Encoding์„ ๋ผ์–น์€.
  • MultiModal end-to-end ๋ชจ๋ธ์—์˜ Prompt Tuning
  • Fine-tune ๋Œ€์‹ ์— prompt tuning์„ ์ ์šฉ
  • Encoder, Decoder ๊ฐ ์ข…๋ฅ˜์˜ ๋ชจ๋ธ์— ์ ์šฉํ•˜๋Š” ๋Œ€์‹ , Encoder-Decoder(unified transformer) ๊ณ„์—ด์˜ ๋ชจ๋ธ์— ์ ์šฉ
    • Encoder, Decoder ํ•œ ๋ถ€๋ถ„์—๋งŒ Prompt-Embedding์„ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ ๋ณด๋‹ค, ๋‘˜ ๋‹ค์— ํ•จ๊ป˜ Prompt embedding์„ ์ถ”๊ฐ€๋กœ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์ด ํšจ๊ณผ๊ฐ€ ์ข‹๋‹ค. (์•„์ฃผ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ธ๋‹ค๊ณ  ํ•จ)
    • ๋งŒ์•ฝ Encoder, Decoder ์ค‘ ํ•˜๋‚˜์—๋งŒ ๋„ฃ์–ด์•ผ ํ•œ๋‹ค๋ฉด Encoder์— ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์ด ๋‚ซ๋‹ค.
  • Zero-shot, Few-shot Learning, In-context Learning ๋งฅ๋ฝ์—์„œ ์œ ์˜๋ฏธํ•œ ํŽ˜์ดํผ
  • ์ „์ฒด weight ์ค‘ 1%์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„ (100M์ด๋ผ๋ฉด 1M, 930M์ด๋ผ๋ฉด ๋Œ€๋žต 9M) Params๋งŒ ์ˆ˜์ •
  • Prompt์˜ ๊ธธ์ด๊ฐ€ ๊ธธ ์ˆ˜๋ก ๋” ํšจ๊ณผ๊ฐ€ ์ข‹์•„์ง
    • 20 tokens ์ด์ƒ์˜ prompt๊ฐ€ ์œ ์˜๋ฏธํ•œ ๋ชจ์Šต์„ ๋ณด์ธ๋‹ค.
    • ํ•˜์ง€๋งŒ ๋„ˆ๋ฌด ๊ธด prompt๋Š” ์˜คํžˆ๋ ค ํšจ๊ณผ๋ฅผ ๋–จ์–ด๋œจ๋ฆผ.(128๋„˜์–ด๊ฐ€๋ฉด ๋ณ„๋กœ๋ผ๊ณ  ํ•˜๋Š” ๋“ฏ)
    • ์‹คํ—˜์ ์œผ๋กœ, 64 tokens์˜ prompt๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์คฌ๋‹ค๊ณ  ํ•จ
  • Reparameterization(ExBERT์ฒ˜๋Ÿผ, ์ถ”๊ฐ€์ ์œผ๋กœ MLP๊ฐ™์€ ํ•™์Šต layer๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ๋Š” ๊ฒƒ)์€ ์ƒ๊ฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Œ
    • ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ํ•˜๋ฝ์„ ๋ณด์ด๊ธฐ๋„ ํ•จ
  • Prompt Embedding Matrix๋ฅผ ์ผ์ข…์˜ Prompt Generator function์œผ๋กœ ์ทจ๊ธ‰
ย 

Experiments & Results

notion image
  • ์„ฑ๋Šฅ ๋น„๊ต์‹œ finetuning๊ณผ prompt tuning์ด ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ํฌ์ง€ ์•Š์€ ๋ชจ์Šต์„ ๋ณด์—ฌ์ค€๋‹ค.
    • ์‹คํ—˜์€ NLU, NLG๊ฐ€ ๊ฐ€๋Šฅํ•œ VQA ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ
    • 180M~470M ๋ชจ๋ธ ํฌ๊ธฐ๋กœ ๊ตฌ์„ฑ
ย 
notion image
  • ์œ ์‚ฌ ๋ฐฉ๋ฒ•๋ก ์ธ Bitfit๊ณผ Adapter์™€๋„ ๋น„๊ต. ์„ฑ๋Šฅ์ด ๋” ์ข‹๋‹ค๊ณ  ์ด์•ผ๊ธฐ ํ•จ.
ย 
notion image
  • Prompt Length์— ๋”ฐ๋ผ์„œ Downstream Task์— ์„ฑ๋Šฅ ๋น„๊ต
    • ๋Œ€๋žต 60(64) ๋ถ€๊ทผ์—์„œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ๊ฒŒ ๋‚˜์˜จ๋‹ค๊ณ  ํ•จ (SNLI-VE์˜ test set์—์„œ ๊ฐ€์žฅ ๊ทน๋‹จ์ ์ธ ๋“ฏ)
    • 64๊ฐ€ ํ‰๊ท ์ ์œผ๋กœ ๋‚ซ๋”๋ผ
ย 
notion image
  • Prompt Tuning์ด Finetune๋ณด๋‹ค ๋‚˜์€ ์  ํ•˜๋‚˜๋กœ: Adversarial Attack์— ๊ฐ•๊ฑดํ•˜๋‹ค.
    • Finetuneํ•œ ๋ชจ๋ธ, Prompt Tuningํ•œ ๋ชจ๋ธ ๊ฐ๊ฐ์— ๋Œ€ํ•ด์„œ ๊ณต๊ฒฉ
    • Gradient-based๋กœ Adv attack
    • ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ํ›จ์”ฌ ์ ๋”๋ผ (์ ˆ๋ฐ˜์ •๋„)
ย 
notion image
  • ์•ž์„œ ์ด์•ผ๊ธฐ ํ–ˆ๋˜.. Enc + Dec vs Enc vs Dec
  • ์–ด๋””์— ๋„ฃ๋А๋ƒ์— ๋”ฐ๋ผ์„œ ์„ฑ๋Šฅ์˜ ์ฐจ์ด๊ฐ€ ์žˆ์Œ
    • ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๋ผ๊ณ  ๋ณด์ด๊ธด ํ•จ (์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ trend๊ฐ€ ์ผ์ •)
    • Encoder-Decoder ์ „๋ถ€ ๋„ฃ์–ด์ฃผ๋Š”๊ฒŒ ์ œ์ผ ๋‚ซ๋‹ค.
    • ์•ˆ๋˜๋ฉด encoder only๋ผ๋„.
ย 
notion image
  • ๋ชจ๋ธ์— MLP ๋“ฑ ์ถ”๊ฐ€ํ•œ Reparameterization
    • MLP๊ฐ€ ์ชผ๋” ๋” ์„ฑ๋Šฅ์ด ์ข‹์€๊ฐ€? ์‹ถ์œผ๋ฉด์„œ๋„ ์˜คํžˆ๋ ค ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ(SNLI-VE dev/test, COCO, VQA)๊ฐ€ ๊ฝค ์žˆ์Œ
ย 

Discussion

  • Prompt tuning์ด finetune์„ ๋Œ€์ฒดํ• ์ˆ˜๋Š” ์—†์ง€๋งŒ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹ค
  • ๋ฌธ์ œ1: SLOOOOOOW Convergence
    • ๋ถ„๋ช…, 1%์ •๋„์˜ params๋งŒ ํ•™์Šตํ•˜๋‹ˆ๊นŒ ํ•™์Šต cost๋Š” ๋‚ฎ๊ณ , Efficiency๋Š” ๋†’์Œ.
    • ๊ทผ๋ฐ.. ์ œ๋Œ€๋กœ ๋œ ํ•™์Šต์„ ํ•˜๋ ค ํ•˜๋‹ˆ 40epochs(?!)์˜ ํ•™์Šต์ด ํ•„์š”
    • GPU-Hours๋กœ ๊ณ„์‚ฐํ•˜๋‹ˆ๊นŒ.. ์—„์ฒญ ์ ˆ์•ฝ๋˜๋Š” ๊ทธ๋Ÿฐ๊ฒŒ ์•„๋‹˜
  • ๋ฌธ์ œ2: HyperParams ํŠœ๋‹์ด ์–ด๋ ค์›€
    • Fine-tuneํ•  ๋•Œ ์“ฐ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๊ธฐ๋ฒ•์„ ์“ฐ๊ธฐ ์–ด๋ ค์›€
    • ๋‹คํ–‰ํžˆ(?) Prompt Tuning์„ ์œ„ํ•œ Hparams ํŠœ๋‹์€ ํฌ๊ฒŒ ์–ด๋ ต์ง€ ์•Š์Œ
  • ๋ฌธ์ œ๊ฐ€ ์žˆ์ง€๋งŒ ๊ทธ๋ž˜๋„ Adv Attack์— ๊ฐ•๊ฑดํ•œ๊ฑด ์ข‹์Œ
ย 
ย 
ย