DeepSpeed ๊ด๋ จ... (ZeRO, ZeRO-2, Megatron-LM)
DeepSpeed Inference @ 2021.05.24
๊ฑฐ๋ํ ๋ชจ๋ธ Inference ์๋ ํฅ์์ํค๊ธฐ
- ๊ฑฐ๋ ๋ชจ๋ธ์ ํฌ๊ฒ ๋ ์ด์๊ฐ ์๋ค.
- Inference ์๋ ์์ฒด๊ฐ ๋๋ฆฌ๋ค๋ ๋ฌธ์
- Inference ์ํด์๋ Vram ํฐ GPU์จ์ผ ํ๋ค๋ ๋ฌธ์
- DeepSpeed Inference?
- Multi-GPU ํ๊ฒฝ์์ ๋น ๋ฅธ Inference๋ฅผ ๊ฐ๋ฅํ๊ฒ ํ์!
- Inference๋ฅผ ์ํ Parallelism
- Inference ์ต์ ํ๋ CUDA
- 8bit Quantize aware training
- ์ ๊ณผ์ ๋ค ํตํด์ 2~4๋ฐฐ latency, 1/3-1/6์ผ๋ก cost ๊ฐ์
- Affordable, fast, and accurate training
- Compressed training
- coarse-grained sparsity in Transformer layers via Progressive Layer Dropping
- 1-bit LAMB
- 1bit adam๊ฐ์ ๋๋, Up or Down๋ง ์ ๋ฌํ๋ ๋ฐฉ์
- ์ด์ฐจํผ LR๋ ์์ฒญ ์๊ณ Fixed lr๋ก ์งํ๋์ด์ ๊ตณ์ด ํฌ๊ธฐ๊น์ง ์ ๋ฌํ์ง ์์๋ ๊ด์ฐฎ๋ค๊ณ ํจ
- DeepSpeed Profiler performance tool
- Inference-adapted parallelism
- DeepSpeed ํน์ Huggingface๋ก ํ์ตํ ๋ชจ๋ธ์ DeepSpeed Inference๋ก ๋ถ๋ฌ์ฌ ์ ์๋ค!
- DeepSpeed Inference๊ฐ ์์์ Model์ ์ชผ๊ฐ๊ณ Parallelํ๊ฒ ๋ง๋ค์ด์ค๋ค ๐
- ํ์ฌ Model Parallel์ ์ง์ OK
- Pipeline Parallel์ ์ง์ X (TBD)
- Inference-optimized kernels
- Deep fusion
- element-wise operations๋ง ์ง์ํ๋ ๋ค๋ฅธ ์ปค๋๊ณผ ๋ค๋ฅด๊ฒ, element-wise operations, matrix multiplications, transpositions, and reductions ์ฐ์ฐ์ ์ง์ โ GPU์ ์ปค๋ ๋ก๋ฉํ๋ ์์ด ์ค์ด๋ ๋ค (๋จ ์ก์ธ์ค๋ ๊ฐ์)
- Inference-customized GeMM
- GeMM์ด Small batch size์์๋ ์ฑ๋ฅ์ด ์ ์๋์จ๋ค. ?
- cuBLAS(NVIDIA)๋ณด๋ค ์ฝ 20% ๋ ๋น ๋ฆ (Batch size = 1~10 ์ฌ์ด ๊ธฐ์ค)
- Generic and specialized Transformer kernels
- ์ปค์คํ ์ปค๋!
- Transfomer layer ๊ฐ ์ฐ์ฐ ๋จ๊ณ๋ณ๋ก ์๋๋ฅผ ๋์ด์ฌ๋ ธ๋ค(HOW?๐ง)
- LayerNorm, Softmax, and bias-add๋ฅผ DeepSpeed์ฉ์ผ๋ก ๊ตฌํ
์ฑ๋ฅ ๋น๊ต
- a single NVIDIA V100 Tensor Core GPU with generic and specialized Transformer kernels
- PyTorch Baseline < DS-Generic < DS-Specialized
- Inference latency๋ ๋ฎ์์๋ก ์ข์๊ฒ!
- DS-Inference์์๋ ํ์ตํ ๋ MP(model parallel) ํฌ๊ธฐ์ ์๊ด์์ด Inference MP์ ๋ง์ถฐ์ ๋ชจ๋ธ ์์์ ์๋ผ์ ๋๋ ค์ค๋ค!
- End-to-End GPT NEO 2.7B Inference
deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py
# Filename: gpt-neo-2.7b-generation.py import deepspeed import torch import transformers from transformers import pipeline local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank) deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_method='auto') string = generator("DeepSpeed is", do_sample=True, min_length=50) if torch.distributed.get_rank() == 0: print(string)
# Output [{ 'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions' }]
GPUs
- GPU ์ฌ์ฉ ๊ฐฏ์๋ ๋น์ฐํ ์ค์ด๋ ๋ค.
- 17B params ๋ชจ๋ธ์ด ๊ฐ์ฅ ๊ทน์ ...
- 1.5B๋ ์๋ 1GPU์ ์ฌ๋ผ๊ฐ๊ธฐ ๋๋ฌธ.
- 17B Model inference time: ๊ฑฐ์ ์ ๋ฐ?
- 8Bit quantize ํด๋ ์ฑ๋ฅ์ด ๊ทธ๋ฅ ์๋จ์ด์ง๋ค. (์ ๊ธฐํ๋ค.)
- NoQAT: base fp16
- basic qat: fp16 โ int8
- MoQ: dynamically reduces precision through a predefined schedule
- 16bit๋ก ์์ํด์ 8bit๋ก ์ค์ด๋ ํ์ต
- Quantization with dynamic schedule using second-order information (Eigenvalue)
- ํ์ต ๊ณผ์ ์์ ๊ฐ Layer๊ฐ ์ผ๋ง๋ Sensitive ํ์ง ์๋ ค์ฃผ๋ proxy๋ก์ eigenvalue๋ฅผ ๊ณ์ฐ
- ๊ฐ layer๋ณ๋ก eigenvalue๋ฅผ ๊ณ์ฐ ํ, 8bit๋ก ์ค์ด๊ธฐ ์ ์ ์ถฉ๋ถํ quantize_period๋ฅผ ์ ๊ณต
- ํ์ต์ด ๋ง์ด ๋๋ ค์ง... ๊ฐ layer๋ณ๋ก eigenvalue ๊ณ์ฐํด์ผํ๋ค.
- fp16์์๋ eigenvalue๊ฐ nan inf๊ฐ ๋ฐ ์ ์์ โ 1
- 1bit optimizer์ฐ๋ฉด ์ด๋ฐ์
quantize_period
๊ฐ ๋๋ฌด ์ปค์ง์ ์์. ๊ธฐ๋ณธ๊ฐ์ ์๊ฒ ์ก์. - eigenvalue ๊ณ์ฐํ๋ค๊ณ ์ฑ๋ฅ์ด '๊ผญ' ์ข์์ง๋๊ฑด ์๋.
Deepspeed config jsonํ์ผ๋ก ์ธํ ๊ฐ๋ฅ
{ "optimizer": { "type": "AdamW", "params": { "lr": 2e-5, "weight_decay": 0.0, "bias_correction": true } }, "gradient_clipping": 1.0, "fp16": { "initial_scale_power": 16, "enabled": true }, "quantize_training": { "enabled": true, "quantize_verbose": true, "quantizer_kernel": true, "quantize-algo": { "q_type": "symmetric" }, "quantize_bits": { "start_bits": 16, "target_bits": 8 }, "quantize_schedule": { "quantize_period": 400, "schedule_offset": 0 }, "quantize_groups": 8, } }
Compressed training with Progressive Layer Dropping
- ํ์ต ์๋๋ 2.5๋ฐฐ ๋น ๋ฅธ๋ฐ Accuracy๋ ๊ทธ๋๋ก?๐ง
- Sparse training
- "sparsely updating model weights while achieving comparable accuracy of dense training"
- aka "Progressive Layer Dropping"
- dynamically switches off Transformer layers during each iteration based on a progressive schedule that accounts for model sensitivity along both the temporal and depth dimensions
- ์ฆ Sensitive์ ๋ฐ๋ผ์ Layer์ Head๋ฅผ ๋ฅผ ์ผฐ๋ค ๊ป๋ค(์ ๊ตฌ๋ชจ์) ํ๋ค๋ ๋ป.
- ๊บผ์ง ๊ตฌ๊ฐ์ ๊ทธ๋ฅ Bypass
- Layer๊ฐ ์์์ง๋ ํจ๊ณผ๊ฐ ๋๋ค.
- ํ๊ท ์ฝ 24% ์์์๊ฐ ๊ฐ์ํจ๊ณผ
- ๊ฐ Input๋ณ๋ก โ ๋ชจ๋ธ์ ์ผ๋ถ๋ถ๋ง update *์ฆ ์ผ์ข ์ Dropout ๋๋
- ๋ง์ฝ Pre-LN Transformer architecture ๊ฐ์ด ์ฐ๋ฉด โ ๋ ๋์ LR ์ ์ฉ ๊ฐ๋ฅ
- ์ ๋ฐ ์ ๋์ ๋ฐ์ดํฐ์ ๋ง ์ฌ์ฉํด๋ ์ฑ๋ฅ์ด ๋์จ๋ค. (๋์ LR์ด๋๊น)
- Progressive Layer Drop + Pre-LN Transformer architecture โ ์ฝ 2.8๋ฐฐ ์๋ ์์น
1-bit LAMB
- GPU๊ฐ ํต์ ํจ์จ ++ : ์ปค๋ฎค๋์ผ์ด์ ๋น์ฉ ์ฝ 1/5
- ์ ์ฒด ํ์ต ์๋ ++: ์ฝ 3๋ฐฐ ์๋
- 1-bit LAMB with NCCL ๊ธฐ์ค.
- NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs).
- Currently the MPI-based implementation is not compatible with pipeline parallelism.
- Frequent checkpoint loading could hurt 1-bit LAMBโs convergence.
- โ๋ผ๋ ์ด์๋ค์ด ์ฌ์ ํ ์์!
Config๋ ์ด๋ ๊ฒ..(deepspeedconfig)
{ "train_batch_size": 65536, "train_micro_batch_size_per_gpu": 64, # 1024๋ GPU "optimizer": { "type": "OneBitLamb", "params": { "lr": 11e-3, "max_coeff": 0.3, "min_coeff": 0.01, "freeze_step": 1000, "cuda_aware": false, "comm_backend_name": "nccl", "coeff_beta": 0.9, "factor_max": 4.0, "factor_min": 0.5, "factor_threshold": 0.1 } }, "gradient_clipping": 1.0, "fp16": { "enabled": true, "loss_scale": 0, "initial_scale_power": 16 } }
DeepSpeed, ZeRO-Infinity @ 2021.04.19
ZeRO ํ๋๋ก GPU 1๋๋ถํฐ ์์ฒ๋ GPU๊น์ง ๋ฌด๋ ค nvme๋ ํ์ต์ ๊ฐ๋ค์ด๋ค. *SW๋ ZeRO-3์ ํตํฉ๋จ
- (์ด๋ก ์) 30Trillion params ๋ชจ๋ธ์ ํ์ตํ ์ ์๋ค.
- 25 petaflops of sustained throughput on 512 NVIDIA V100 GPUs
- GPT-3๋ฅผ Single GPU(V100, 32G)์์ Finetuneํ ์ ์๋ค...?
- PyTorch-Lightning 1.2, Huggingface Transfomrers 4.2+ ๋ถํฐ ์ง์
- ์ค์ ๋ก ์์ ํตํฉ์ Huggingface Transfoemrs 4.6.0+
THIS IS ZeRO
์์งํ ์ด๊ฒ๋ง ๋ณด๋ฉด ๋ ๋ฏ ํ๋ค.
- ๊ทธ๋ฌ๋๊น, DGX-2 32๋๋ก ์ด๋ฏธ 32Trillion params๊ฐ ๋์๊ฐ๋ค๋๊ฑธ ํ์ธํ๋ค๋ ๊ฒ ๊ฐ๋ค.
- ๋งจ๋ ๋์ค๋ ์ด ๊ทธ๋ฆผ, ZeRO 1, 2, 3. (์์๋๋ก..)
- ZeRO-3์์ ๋์จ NVMe Offload.
- ์คํ ๋ฆฌ์ง๋ฅผ ๋ฉ๋ชจ๋ฆฌ์ฒ๋ผ ์ฐ๋ ค๊ณ ํจ..
- ๊ฒฐ๊ตญ 1 layer 1 head๊ฐ 1GPU์์ ์ฌ๋ผ๊ฐ๋ค๊ณ ๋ณด๋ฉด ๋จ.
The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage
- DGX-2 1๋์์ ์๋์ ๊ฐ์ Distributed๋ก ํ์ต ๊ฐ๋ฅํ ํฌ๊ธฐ.
- 32GB Vram ์ต๋ ํ์ต ๋ชจ๋ธ ํฌ๊ธฐ = 1.4Billion
- 1๋๋ก๋ CPU Offload๋ก 13B ํ์ต์ ๊ฐ๋ฅ
- Huggingface with DeepSpeed: ์๋ 3๋ฐฐ์ ๋ ๋น ๋ฅด๊ฒ ํ์ต์ด ๊ฐ๋ฅํ๋ค๊ณ ํจ
Refs
- MS ๋ธ๋ก๊ทธ ZeRO-infinity ์๊ฐ: https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/
- MS ๋ธ๋ก๊ทธ DeepSpeed Inference ์๊ฐ: https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/