ZeRO-Infinity

DeepSpeed 관련... (ZeRO, ZeRO-2, Megatron-LM)

DeepSpeed Inference @ 2021.05.24

거대한 모델 Inference 속도 향상시키기

거대 모델은 크게 두 이슈가 있다.

Inference 속도 자체가 느리다는 문제
Inference 위해서도 Vram 큰 GPU써야 한다는 문제

DeepSpeed Inference?

Multi-GPU 환경에서 빠른 Inference를 가능하게 하자!
Inference를 위한 Parallelism
Inference 최적화된 CUDA
8bit Quantize aware training
위 과정들 통해서 2~4배 latency, 1/3-1/6으로 cost 감소

Affordable, fast, and accurate training

Compressed training

coarse-grained sparsity in Transformer layers via Progressive Layer Dropping

1-bit LAMB

1bit adam같은 느낌, Up or Down만 전달하는 방식
어차피 LR도 엄청 작고 Fixed lr로 진행되어서 굳이 크기까지 전달하지 않아도 괜찮다고 함

DeepSpeed Profiler performance tool

Inference-adapted parallelism

DeepSpeed 혹은 Huggingface로 학습한 모델을 DeepSpeed Inference로 불러올 수 있다!
DeepSpeed Inference가 알아서 Model을 쪼개고 Parallel하게 만들어준다 😍

현재 Model Parallel은 지원 OK
Pipeline Parallel은 지원 X (TBD)

Inference-optimized kernels

Deep fusion

element-wise operations만 지원하는 다른 커널과 다르게, element-wise operations, matrix multiplications, transpositions, and reductions 연산을 지원 → GPU에 커널 로딩하는 양이 줄어든다 (램 액세스도 감소)

Inference-customized GeMM

GeMM이 Small batch size에서는 성능이 잘 안나온다. ?
cuBLAS(NVIDIA)보다 약 20% 더 빠름 (Batch size = 1~10 사이 기준)

Generic and specialized Transformer kernels

커스텀 커널!
Transfomer layer 각 연산 단계별로 속도를 끌어올렸다(HOW?🧐)

LayerNorm, Softmax, and bias-add를 DeepSpeed용으로 구현

성능 비교

microsoft/DeepSpeed

DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It support model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference. To further reduce latency and cost, we introduce inference-customized kernels.

https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/inference-tutorial.md

a single NVIDIA V100 Tensor Core GPU with generic and specialized Transformer kernels

PyTorch Baseline < DS-Generic < DS-Specialized

Inference latency는 낮을수록 좋은것!

DS-Inference에서는 학습할때 MP(model parallel) 크기와 상관없이 Inference MP에 맞춰서 모델 알아서 잘라서 돌려준다!

End-to-End GPT NEO 2.7B Inference


deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py


# Filename: gpt-neo-2.7b-generation.py
import deepspeed
import torch
import transformers
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B', device=local_rank)



deepspeed.init_inference(generator.model,
                         mp_size=world_size,
                         dtype=torch.float,
                         replace_method='auto')

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if torch.distributed.get_rank() == 0:
    print(string)


# Output
[{
    'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions'
}]

위 코드에서도 커널/GPU 사용 따라 속도 감소 ++

GPUs

GPU 사용 갯수도 당연히 줄어든다.

17B params 모델이 가장 극적...

1.5B는 원래 1GPU에 올라가기 때문.

17B Model inference time: 거의 절반?

8Bit quantize 해도 성능이 그닥 안떨어진다. (신기하다.)

NoQAT: base fp16
basic qat: fp16 → int8
MoQ: dynamically reduces precision through a predefined schedule

16bit로 시작해서 8bit로 줄이는 학습

Deepspeed config json파일로 세팅 가능


{
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": 2e-5,
        "weight_decay": 0.0,
        "bias_correction": true
      }
    },
    "gradient_clipping": 1.0,
    "fp16": {
      "initial_scale_power": 16,
      "enabled": true
    },
    "quantize_training": {
      "enabled": true,
      "quantize_verbose": true,
      "quantizer_kernel": true,
      "quantize-algo": {
        "q_type": "symmetric"
      },
      "quantize_bits": {
        "start_bits": 16,
        "target_bits": 8
      },
      "quantize_schedule": {
        "quantize_period": 400,
        "schedule_offset": 0
      },
      "quantize_groups": 8,
    }
}

Quantization with dynamic schedule using second-order information (Eigenvalue)

학습 과정에서 각 Layer가 얼마나 Sensitive 한지 알려주는 proxy로서 eigenvalue를 계산
각 layer별로 eigenvalue를 계산 후, 8bit로 줄이기 전에 충분한 quantize_period를 제공
학습이 많이 느려짐... 각 layer별로 eigenvalue 계산해야한다.
fp16에서는 eigenvalue가 nan inf가 뜰 수 있음 ⇒ 1
1bit optimizer쓰면 초반에 quantize_period 가 너무 커질수 있음. 기본값을 작게 잡자.
eigenvalue 계산한다고 성능이 '꼭' 좋아지는건 아님.

Compressed training with Progressive Layer Dropping

학습 속도는 2.5배 빠른데 Accuracy는 그대로?🧐

Sparse training

"sparsely updating model weights while achieving comparable accuracy of dense training"

aka "Progressive Layer Dropping"

dynamically switches off Transformer layers during each iteration based on a progressive schedule that accounts for model sensitivity along both the temporal and depth dimensions

즉 Sensitive에 따라서 Layer와 Head를 를 켰다 껐다(전구모양) 한다는 뜻.

꺼진 구간은 그냥 Bypass
Layer가 얕아지는 효과가 된다.

평균 약 24% 소요시간 감소효과

각 Input별로 → 모델의 일부분만 update *즉 일종의 Dropout 느낌

만약 Pre-LN Transformer architecture 같이 쓰면 → 더 높은 LR 적용 가능

절반 정도의 데이터셋만 사용해도 성능이 나온다. (높은 LR이니까)

Progressive Layer Drop + Pre-LN Transformer architecture ⇒ 약 2.8배 속도 상승

1-bit LAMB

GPU간 통신 효율 ++ : 커뮤니케이션 비용 약 1/5

전체 학습 속도 ++: 약 3배 속도

1-bit LAMB with NCCL 기준.

NCCL-based implementation requires PyTorch >= 1.8 (and NCCL >= 2.8.3 when you have 64 or more GPUs).
Currently the MPI-based implementation is not compatible with pipeline parallelism.
Frequent checkpoint loading could hurt 1-bit LAMB’s convergence.
—라는 이슈들이 여전히 있음!

Config는 이렇게..(deepspeedconfig)


{
  "train_batch_size": 65536,
  "train_micro_batch_size_per_gpu": 64, # 1024대 GPU
  "optimizer": {
    "type": "OneBitLamb",
    "params": {
      "lr": 11e-3,
      "max_coeff": 0.3,
      "min_coeff": 0.01,
      "freeze_step": 1000,
      "cuda_aware": false,
      "comm_backend_name": "nccl",
      "coeff_beta": 0.9,
      "factor_max": 4.0,
      "factor_min": 0.5,
      "factor_threshold": 0.1
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  }
}