ZeRO-Infinity
ZeRO-Infinity

ZeRO-Infinity

Tags
๋…ผ๋ฌธ๋ฆฌ๋ทฐ
MLDL Framework
Published
Published May 30, 2021

DeepSpeed ๊ด€๋ จ... (ZeRO, ZeRO-2, Megatron-LM)

DeepSpeed Inference @ 2021.05.24

๊ฑฐ๋Œ€ํ•œ ๋ชจ๋ธ Inference ์†๋„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ
notion image
  • ๊ฑฐ๋Œ€ ๋ชจ๋ธ์€ ํฌ๊ฒŒ ๋‘ ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค.
    • Inference ์†๋„ ์ž์ฒด๊ฐ€ ๋Š๋ฆฌ๋‹ค๋Š” ๋ฌธ์ œ
    • Inference ์œ„ํ•ด์„œ๋„ Vram ํฐ GPU์จ์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ
  • DeepSpeed Inference?
    • Multi-GPU ํ™˜๊ฒฝ์—์„œ ๋น ๋ฅธ Inference๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์ž!
    • Inference๋ฅผ ์œ„ํ•œ Parallelism
    • Inference ์ตœ์ ํ™”๋œ CUDA
    • 8bit Quantize aware training
    • ์œ„ ๊ณผ์ •๋“ค ํ†ตํ•ด์„œ 2~4๋ฐฐ latency, 1/3-1/6์œผ๋กœ cost ๊ฐ์†Œ
  • Affordable, fast, and accurate training
    • Compressed training
      • coarse-grained sparsity in Transformer layers via Progressive Layer Dropping
    • 1-bit LAMB
      • 1bit adam๊ฐ™์€ ๋Š๋‚Œ, Up or Down๋งŒ ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ์‹
      • ์–ด์ฐจํ”ผ LR๋„ ์—„์ฒญ ์ž‘๊ณ  Fixed lr๋กœ ์ง„ํ–‰๋˜์–ด์„œ ๊ตณ์ด ํฌ๊ธฐ๊นŒ์ง€ ์ „๋‹ฌํ•˜์ง€ ์•Š์•„๋„ ๊ดœ์ฐฎ๋‹ค๊ณ  ํ•จ
    • DeepSpeed Profiler performance tool
  • Inference-adapted parallelism
    • DeepSpeed ํ˜น์€ Huggingface๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์„ DeepSpeed Inference๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค!
    • DeepSpeed Inference๊ฐ€ ์•Œ์•„์„œ Model์„ ์ชผ๊ฐœ๊ณ  Parallelํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค ๐Ÿ˜
      • ํ˜„์žฌ Model Parallel์€ ์ง€์› OK
      • Pipeline Parallel์€ ์ง€์› X (TBD)
  • Inference-optimized kernels
    • Deep fusion
      • element-wise operations๋งŒ ์ง€์›ํ•˜๋Š” ๋‹ค๋ฅธ ์ปค๋„๊ณผ ๋‹ค๋ฅด๊ฒŒ, element-wise operations, matrix multiplications, transpositions, and reductions ์—ฐ์‚ฐ์„ ์ง€์› โ†’ GPU์— ์ปค๋„ ๋กœ๋”ฉํ•˜๋Š” ์–‘์ด ์ค„์–ด๋“ ๋‹ค (๋žจ ์•ก์„ธ์Šค๋„ ๊ฐ์†Œ)
    • Inference-customized GeMM
      • GeMM์ด Small batch size์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ž˜ ์•ˆ๋‚˜์˜จ๋‹ค. ?
      • cuBLAS(NVIDIA)๋ณด๋‹ค ์•ฝ 20% ๋” ๋น ๋ฆ„ (Batch size = 1~10 ์‚ฌ์ด ๊ธฐ์ค€)
    • Generic and specialized Transformer kernels
      • notion image
      • ์ปค์Šคํ…€ ์ปค๋„!
      • Transfomer layer ๊ฐ ์—ฐ์‚ฐ ๋‹จ๊ณ„๋ณ„๋กœ ์†๋„๋ฅผ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค(HOW?๐Ÿง)
        • LayerNorm, Softmax, and bias-add๋ฅผ DeepSpeed์šฉ์œผ๋กœ ๊ตฌํ˜„

์„ฑ๋Šฅ ๋น„๊ต

  • a single NVIDIA V100 Tensor Core GPU with generic and specialized Transformer kernels
  • PyTorch Baseline < DS-Generic < DS-Specialized
  • Inference latency๋Š” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์€๊ฒƒ!
notion image
  • DS-Inference์—์„œ๋Š” ํ•™์Šตํ• ๋•Œ MP(model parallel) ํฌ๊ธฐ์™€ ์ƒ๊ด€์—†์ด Inference MP์— ๋งž์ถฐ์„œ ๋ชจ๋ธ ์•Œ์•„์„œ ์ž˜๋ผ์„œ ๋Œ๋ ค์ค€๋‹ค!
  • End-to-End GPT NEO 2.7B Inference
    • deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
      # Filename: gpt-neo-2.7b-generation.py import deepspeed import torch import transformers from transformers import pipeline local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank) deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_method='auto') string = generator("DeepSpeed is", do_sample=True, min_length=50) if torch.distributed.get_rank() == 0: print(string)
      # Output [{ 'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions' }]
      notion image
    • ์œ„ ์ฝ”๋“œ์—์„œ๋„ ์ปค๋„/GPU ์‚ฌ์šฉ ๋”ฐ๋ผ ์†๋„ ๊ฐ์†Œ ++

GPUs

  • GPU ์‚ฌ์šฉ ๊ฐฏ์ˆ˜๋„ ๋‹น์—ฐํžˆ ์ค„์–ด๋“ ๋‹ค.
  • 17B params ๋ชจ๋ธ์ด ๊ฐ€์žฅ ๊ทน์ ...
    • 1.5B๋Š” ์›๋ž˜ 1GPU์— ์˜ฌ๋ผ๊ฐ€๊ธฐ ๋•Œ๋ฌธ.
notion image
  • 17B Model inference time: ๊ฑฐ์˜ ์ ˆ๋ฐ˜?
notion image
  • 8Bit quantize ํ•ด๋„ ์„ฑ๋Šฅ์ด ๊ทธ๋‹ฅ ์•ˆ๋–จ์–ด์ง„๋‹ค. (์‹ ๊ธฐํ•˜๋‹ค.)
    • notion image
    • NoQAT: base fp16
    • basic qat: fp16 โ†’ int8
    • MoQ: dynamically reduces precision through a predefined schedule
      • 16bit๋กœ ์‹œ์ž‘ํ•ด์„œ 8bit๋กœ ์ค„์ด๋Š” ํ•™์Šต
      • Deepspeed config jsonํŒŒ์ผ๋กœ ์„ธํŒ… ๊ฐ€๋Šฅ
        { "optimizer": { "type": "AdamW", "params": { "lr": 2e-5, "weight_decay": 0.0, "bias_correction": true } }, "gradient_clipping": 1.0, "fp16": { "initial_scale_power": 16, "enabled": true }, "quantize_training": { "enabled": true, "quantize_verbose": true, "quantizer_kernel": true, "quantize-algo": { "q_type": "symmetric" }, "quantize_bits": { "start_bits": 16, "target_bits": 8 }, "quantize_schedule": { "quantize_period": 400, "schedule_offset": 0 }, "quantize_groups": 8, } }
      • Quantization with dynamic schedule using second-order information (Eigenvalue)
        • ํ•™์Šต ๊ณผ์ •์—์„œ ๊ฐ Layer๊ฐ€ ์–ผ๋งˆ๋‚˜ Sensitive ํ•œ์ง€ ์•Œ๋ ค์ฃผ๋Š” proxy๋กœ์„œ eigenvalue๋ฅผ ๊ณ„์‚ฐ
        • ๊ฐ layer๋ณ„๋กœ eigenvalue๋ฅผ ๊ณ„์‚ฐ ํ›„, 8bit๋กœ ์ค„์ด๊ธฐ ์ „์— ์ถฉ๋ถ„ํ•œ quantize_period๋ฅผ ์ œ๊ณต
        • ํ•™์Šต์ด ๋งŽ์ด ๋Š๋ ค์ง... ๊ฐ layer๋ณ„๋กœ eigenvalue ๊ณ„์‚ฐํ•ด์•ผํ•œ๋‹ค.
        • fp16์—์„œ๋Š” eigenvalue๊ฐ€ nan inf๊ฐ€ ๋œฐ ์ˆ˜ ์žˆ์Œ โ‡’ 1
        • 1bit optimizer์“ฐ๋ฉด ์ดˆ๋ฐ˜์— quantize_period ๊ฐ€ ๋„ˆ๋ฌด ์ปค์งˆ์ˆ˜ ์žˆ์Œ. ๊ธฐ๋ณธ๊ฐ’์„ ์ž‘๊ฒŒ ์žก์ž.
        • eigenvalue ๊ณ„์‚ฐํ•œ๋‹ค๊ณ  ์„ฑ๋Šฅ์ด '๊ผญ' ์ข‹์•„์ง€๋Š”๊ฑด ์•„๋‹˜.

Compressed training with Progressive Layer Dropping

notion image
  • ํ•™์Šต ์†๋„๋Š” 2.5๋ฐฐ ๋น ๋ฅธ๋ฐ Accuracy๋Š” ๊ทธ๋Œ€๋กœ?๐Ÿง
  • Sparse training
    • "sparsely updating model weights while achieving comparable accuracy of dense training"
  • aka "Progressive Layer Dropping"
  • dynamically switches off Transformer layers during each iteration based on a progressive schedule that accounts for model sensitivity along both the temporal and depth dimensions
  • ์ฆ‰ Sensitive์— ๋”ฐ๋ผ์„œ Layer์™€ Head๋ฅผ ๋ฅผ ์ผฐ๋‹ค ๊ป๋‹ค(์ „๊ตฌ๋ชจ์–‘) ํ•œ๋‹ค๋Š” ๋œป.
    • ๊บผ์ง„ ๊ตฌ๊ฐ„์€ ๊ทธ๋ƒฅ Bypass
    • Layer๊ฐ€ ์–•์•„์ง€๋Š” ํšจ๊ณผ๊ฐ€ ๋œ๋‹ค.
  • ํ‰๊ท  ์•ฝ 24% ์†Œ์š”์‹œ๊ฐ„ ๊ฐ์†Œํšจ๊ณผ
  • ๊ฐ Input๋ณ„๋กœ โ†’ ๋ชจ๋ธ์˜ ์ผ๋ถ€๋ถ„๋งŒ update *์ฆ‰ ์ผ์ข…์˜ Dropout ๋Š๋‚Œ
  • ๋งŒ์•ฝ Pre-LN Transformer architecture ๊ฐ™์ด ์“ฐ๋ฉด โ†’ ๋” ๋†’์€ LR ์ ์šฉ ๊ฐ€๋Šฅ
    • ์ ˆ๋ฐ˜ ์ •๋„์˜ ๋ฐ์ดํ„ฐ์…‹๋งŒ ์‚ฌ์šฉํ•ด๋„ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹ค. (๋†’์€ LR์ด๋‹ˆ๊นŒ)
  • Progressive Layer Drop + Pre-LN Transformer architecture โ‡’ ์•ฝ 2.8๋ฐฐ ์†๋„ ์ƒ์Šน
    • notion image

1-bit LAMB

notion image
  • GPU๊ฐ„ ํ†ต์‹  ํšจ์œจ ++ : ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜ ๋น„์šฉ ์•ฝ 1/5
  • ์ „์ฒด ํ•™์Šต ์†๋„ ++: ์•ฝ 3๋ฐฐ ์†๋„
  • 1-bit LAMB with NCCL ๊ธฐ์ค€.
    • NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs).
    • Currently the MPI-based implementation is not compatible with pipeline parallelism.
    • Frequent checkpoint loading could hurt 1-bit LAMBโ€™s convergence.
    • โ€”๋ผ๋Š” ์ด์Šˆ๋“ค์ด ์—ฌ์ „ํžˆ ์žˆ์Œ!
Config๋Š” ์ด๋ ‡๊ฒŒ..(deepspeedconfig)
{ "train_batch_size": 65536, "train_micro_batch_size_per_gpu": 64, # 1024๋Œ€ GPU "optimizer": { "type": "OneBitLamb", "params": { "lr": 11e-3, "max_coeff": 0.3, "min_coeff": 0.01, "freeze_step": 1000, "cuda_aware": false, "comm_backend_name": "nccl", "coeff_beta": 0.9, "factor_max": 4.0, "factor_min": 0.5, "factor_threshold": 0.1 } }, "gradient_clipping": 1.0, "fp16": { "enabled": true, "loss_scale": 0, "initial_scale_power": 16 } }

DeepSpeed, ZeRO-Infinity @ 2021.04.19

ZeRO ํ•˜๋‚˜๋กœ GPU 1๋Œ€๋ถ€ํ„ฐ ์ˆ˜์ฒœ๋Œ€ GPU๊นŒ์ง€ ๋ฌด๋ ค nvme๋„ ํ•™์Šต์— ๊ฐ–๋‹ค์“ด๋‹ค. *SW๋Š” ZeRO-3์— ํ†ตํ•ฉ๋จ
  • (์ด๋ก ์ƒ) 30Trillion params ๋ชจ๋ธ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.
  • 25 petaflops of sustained throughput on 512 NVIDIA V100 GPUs
  • GPT-3๋ฅผ Single GPU(V100, 32G)์—์„œ Finetuneํ•  ์ˆ˜ ์žˆ๋‹ค...?
  • PyTorch-Lightning 1.2, Huggingface Transfomrers 4.2+ ๋ถ€ํ„ฐ ์ง€์›
    • ์‹ค์ œ๋กœ ์™„์ „ ํ†ตํ•ฉ์€ Huggingface Transfoemrs 4.6.0+

THIS IS ZeRO

๐Ÿ’ก
์†”์งํžˆ ์ด๊ฒƒ๋งŒ ๋ณด๋ฉด ๋  ๋“ฏ ํ•˜๋‹ค.
notion image
  • ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ, DGX-2 32๋Œ€๋กœ ์ด๋ฏธ 32Trillion params๊ฐ€ ๋Œ์•„๊ฐ„๋‹ค๋Š”๊ฑธ ํ™•์ธํ–ˆ๋‹ค๋Š” ๊ฒƒ ๊ฐ™๋‹ค.
  • ๋งจ๋‚  ๋‚˜์˜ค๋Š” ์ด ๊ทธ๋ฆผ, ZeRO 1, 2, 3. (์ˆœ์„œ๋Œ€๋กœ..)
notion image
  • ZeRO-3์—์„œ ๋‚˜์˜จ NVMe Offload.
    • ์Šคํ† ๋ฆฌ์ง€๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์ฒ˜๋Ÿผ ์“ฐ๋ ค๊ณ  ํ•จ..
notion image
  • ๊ฒฐ๊ตญ 1 layer 1 head๊ฐ€ 1GPU์œ„์— ์˜ฌ๋ผ๊ฐ„๋‹ค๊ณ  ๋ณด๋ฉด ๋จ.
๐Ÿ’ก
The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage
  • DGX-2 1๋Œ€์—์„œ ์•„๋ž˜์™€ ๊ฐ™์€ Distributed๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํฌ๊ธฐ.
  • 32GB Vram ์ตœ๋Œ€ ํ•™์Šต ๋ชจ๋ธ ํฌ๊ธฐ = 1.4Billion
  • 1๋Œ€๋กœ๋„ CPU Offload๋กœ 13B ํ•™์Šต์€ ๊ฐ€๋Šฅ
notion image
  • Huggingface with DeepSpeed: ์†๋„ 3๋ฐฐ์ •๋„ ๋น ๋ฅด๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํ•จ

Refs