Word-Level Fine-Grained Story Visualization
๐Ÿ“ธ

Word-Level Fine-Grained Story Visualization

Tags
NLP
CV
๋…ผ๋ฌธ๋ฆฌ๋ทฐ
Published
Published August 5, 2022
  • Oxford Univ.
  • [Submitted on 3 Aug 2022]

Paper TL;DR

๐Ÿ’ก
๋ฌธ์žฅํ•˜๋‚˜ to ์ด๋ฏธ์ง€ ์•„๋‹ˆ๋ผ, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ(์Šคํ† ๋ฆฌ) to ์ด๋ฏธ์ง€๋“ค, aka โ€œStory Visualizationโ€ ์ƒˆ๋กœ์šด Loss์™€ ์ƒˆ๋กœ์šด Sentence Representation์„ ํ†ตํ•ด SoTA ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋‹ค.
notion image

Story Visualization?

โ€œํ…์ŠคํŠธ โ†’ ์ด๋ฏธ์ง€โ€๋กœ ๋งŒ๋“œ๋Š” ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ๋…ผ๋ฌธ๋“ค์ด ๋‚˜์˜ค๊ณ  ์žˆ๊ณ , DALL-E 2 ํ˜น์€ CLIP, ImaGEN์™€ ๊ฐ™์€ ์ข…๋ฅ˜์˜ Text-to-Image Diffsuion ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์ด ์ƒ๋‹นํžˆ ๋งŽ์ด ๋‚˜์˜ค๊ณ  ์žˆ๋‹ค.
์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์€ ์„ฑ๋Šฅ๋„ ๊ดœ์ฐฎ์€(๐Ÿค”)ํŽธ์ด์ง€๋งŒ, โ€œํ•œ ๋ฌธ์žฅ โ†’ ํ•œ ์ด๋ฏธ์ง€" ๋กœ ์ƒ์„ฑํ•˜๋Š” ํŠน์„ฑ์ƒ ์ด์•ผ๊ธฐ์™€ ๊ฐ™์ด ์—ฐ์†์ ์œผ๋กœ ์ด์–ด์ง€๋Š” ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ์—ฐ์†์„ฑ(consistancy)๋ฅผ ์œ ์ง€ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค.
StoryGAN์—์„œ ์ตœ์ดˆ๋กœ ์ œ์•ˆํ•œ โ€˜Story Visualizationโ€™ ๋ถ„์•ผ๋Š” Text-to-Image๋ฅผ ๋„˜์–ด Story-to-Images๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ๋ถ„์•ผ๋‹ค. Video Generation๊ณผ๋Š” ๋˜ ์กฐ๊ธˆ ๋‹ค๋ฅธ๋ฐ, Video๋Š” ํ”„๋ ˆ์ž„๋ณ„๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด์•ผ ํ•œ๋‹ค๋Š” ํŠน์„ฑ์ƒ ์—ฐ์†๋œ ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•˜์ง€๋งŒ, Story visualization์€ ํ•œ ์ด๋ฏธ์ง€๊ฐ€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์™€ โ€œ์˜๋ฏธ์ ์œผ๋กœ ์—ฐ์†"์ด์–ด์•ผ ํ•˜์ง€๋งŒ, ๋ฌผ๋ฆฌ์  ํ”ฝ์…€๋กœ ์—ฐ์†์ผ ํ•„์š”๋Š” ์—†๋‹ค๋Š” ์ ์ด ๋‹ค๋ฅด๋‹ค.
StoryGAN ์ดํ›„ CP-CSV(Segmentation Mask ์ถ”๊ฐ€: ์บ๋ฆญํ„ฐ ์ผ๊ด€์„ฑ up), DUCO, VLC(๋ณด์กฐ Captioning Net ์ถ”๊ฐ€: ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ์—ฐ๊ด€์„ฑ up)๋“ฑ์ด ์ถ”๊ฐ€๋กœ ๋‚˜์™”์ง€๋งŒ backbone์œผ๋กœ ์—ฌ์ „ํžˆ StoryGAN์„ ์‚ฌ์šฉํ–ˆ๊ณ , ๊ฑฐ๊ธฐ์— ์ถ”๊ฐ€์ ์ธ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ถ™์—ฌ์ฃผ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ๋‹ฌ์ด ์ง„ํ–‰๋˜์–ด์™”๋‹ค.
ํ•œํŽธ, ํ•ด๋‹น ๋ชจ๋ธ๋“ค(prev works)๋Š” ๋Œ€๋ถ€๋ถ„ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€ ํ€„๋ฆฌํ‹ฐ์˜ ์ด์Šˆ๊ฐ€ ์žˆ๊ณ , ์‹ค์ œ ์œ„ ์‚ฌ์ง„์˜ ์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด ๊ทธ๋ฆผ์ด ๋ญ‰๊ฐœ์ง„(..)์ƒํƒœ๋กœ ๋ณด์ธ๋‹ค.
๋˜ํ•œ, ๋‹จ์ˆœํžˆ โ€œText To Image Generationโ€์„ ๋„˜์–ด โ€œStory Visualizationโ€์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•ด๋‹น ์ค„๊ฑฐ๋ฆฌ์˜ โ€˜์ผ๊ด€์„ฑ'์ด๋ผ๋Š” ์ธก๋ฉด์—์„œ ๋ณด๋‹ค ์‹ ๊ฒฝ์จ์ค˜์•ผ ํ•˜๋Š” ์ธก๋ฉด์ด ๋งŽ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ˜„์žฌ ์Šคํ† ๋ฆฌ๊ฐ€ ์ง„ํ–‰๋˜๋Š” ์ƒํ™ฉ ์†์— ๋“ฑ์žฅํ•˜๋Š” ์ธ๋ฌผ, ๊ทธ๋ฆฌ๊ณ  ๋ฐฐ๊ฒฝ, ์˜ค๋ธŒ์ œ ๋“ฑ์˜ ์ผ๊ด€์„ฑ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์—์„œ ๋ถ„๋ช… ์กฐ๊ธˆ ๋” ์–ด๋ ค์šด Task์ด๋‹ค.

๋…ผ๋ฌธ์—์„œ ๋ฌด์—‡์„ ์ œ์•ˆํ•˜๋‚˜?

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ช‡ ๊ฐ€์ง€์˜ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค.
  1. (New) Sentence Representation: Global consistency๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์Šคํ† ๋ฆฌ ๋‚ด์˜ ๋ฌธ์žฅ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฒกํ„ฐ๋กœ์„œ Sentence Representation์„ ์ œ์•ˆํ•œ๋‹ค.
  1. (New) Discriminator w/ Fusion features: ์ด๋ฏธ์ง€ Generator์—๊ฒŒ ๋ณด๋‹ค ๋” ๋””ํ…Œ์ผํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ ์ƒˆ๋กœ์šด Discriminator๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
  1. Word-Level Spatial Attention: Image๋‚ด์— ๋“ค์–ด๊ฐ„ word์™€์˜ Attention์ด ๋‹จ์ˆœํžˆ ์ด๋ฏธ์ง€-๋‹จ์–ด 1:1๋กœ๋งŒ ์—ฐ๊ฒฐ๋˜๋Š” ๊ฒƒ์„ ๋„˜์–ด์„œ, ํ•ด๋‹น ์Šคํ† ๋ฆฌ ๋ฌธ์žฅ๋“ค๊ณผ ์ด๋ฏธ์ง€ ๊ฐ„ (์ด๋ฏธ์ง€์™€ ๋ฌธ์žฅ ๋‹จ์œ„๋ฅผ ๋„˜์–ด, ์Šคํ† ๋ฆฌ ๋‹จ์œ„๋กœ) Attention์„ ๊ฑธ์–ด์ค€๋‹ค.
์œ„ ์„ธ ๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด์„œ ๊ธฐ์กด์˜ StoryGAN ๋ฐฑ๋ณธ์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Pororo-SV

์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์€ Pororo-SV๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์ธ๋ฐ, ๋ฝ€๋กœ๋กœ ์˜์ƒ์„ 64x64์˜ ์ด๋ฏธ์ง€ ์—ฐ์†์œผ๋กœ ๋งŒ๋“ค์–ด ํ•ด๋‹น ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด VQA๋ฅผ ํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.

๊นƒํ—™์ด ์žˆ๋‚˜?

notion image
๋…ผ๋ฌธ์— ๋งํฌ(
Word-Level-Story-Visualization
mrlibw โ€ข Updated Aug 8, 2022
)๊ฐ€ ์žˆ์ง€๋งŒ, ์œ„ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ ๋น„์–ด์žˆ๋‹ค (ใ… ใ… )

Word-Level Fine-Grained Story Visualization

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

notion image
์ด ๋…ผ๋ฌธ์—์„œ๋„ ์—ฌํƒ€ ๋…ผ๋ฌธ๊ณผ ๋น„์Šทํ•˜๊ฒŒ StoryGAN์„ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•ด ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์ง„ํ–‰ํ•œ๋‹ค.
์ด๊ฒƒ์— ๋”ํ•ด,
  • Global Sentence Vectors
  • Fine-grained Word Embeddings
๋‘ ๊ฐ€์ง€ ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๊ณ , ํƒ€ ๋…ผ๋ฌธ๊ณผ ๋‹ค๋ฅด๊ฒŒ StoryGAN ๊ตฌ์กฐ ์ž์ฒด๋ฅผ ๊ฑด๋“œ๋ฆฌ๋Š”(์ถ”๊ฐ€ ๋ชจ๋“ˆ์„ ๋ถ™์ธ๋‹ค๊ฑฐ๋‚˜) ๋ฐฉ๋ฒ•์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.
Spatial Attention์„ ๊ธฐ์กด Hidden vector์™€ summationํ•ด์ค€ Input์œผ๋กœ ๋ณ€ํ˜•ํ•˜๊ณ , ๊ทธ ์ดํ›„ Iamge Discriminator์™€ Story Discriminator๋ฅผ ๋ถ€์—ฌ์„œ Loss๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค.

Sentence Representation with Word Information

ย 
ย 
ย