MLC LLM

설치

MLC-LLM 패키지 설치

Conda 가상환경 필수! ⭐

https://mlc.ai/mlc-llm/docs/compilation/compile_models.html#id8


# clone the repository
git clone git@github.com:mlc-ai/mlc-llm.git --recursive
# enter to root directory of the repo
cd mlc-llm
# install mlc-llm
pip install .

python3 -m mlc_llm.build --help # Verify

`No module named 'tvm'` 에러시

https://github.com/mlc-ai/mlc-llm/issues/330

https://mlc.ai/package/ ← 여기서 OS, CUDA버전 맞추서 받으면 된다.


# pip install -I mlc_ai_nightly -f https://mlc.ai/wheels
pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels
pip install -U pytest
conda install -c conda-forge gcc


(mlc-llm) w3090 :: ~/coding/mlc-llm » python -c "import tvm; print(tvm.cuda().exist)" 
True

위와 같이 CUDA True가 나와야 한다.

모델 컨버팅


python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1
python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target vulkan --quantization q4f16_1

`TypeError: runtime.Module is not registered via TVM_REGISTER_NODE_TYPE` 에러


python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1 --use-cache=0 # 캐시 없이 실행

python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target vulkan --quantization q4f16_1 --use-cache=0

python3 -m mlc_llm.build --hf-path kfkas/Llama-2-ko-7b-Chat --target vulkan --quantization q4f16_1 --use-cache=0

`gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory`

g++, gcc 두개가 버전 달라서 생기는 문제


(mlc-llm) w3090 :: ~/coding/mlc-llm » g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(mlc-llm) w3090 :: ~/coding/mlc-llm » gcc --version
gcc (conda-forge gcc 13.1.0-0) 13.1.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


conda install -c conda-forge gxx # g++은 gxx로 설치함

`/usr/local/cuda-11.8/bin/../targets/x86_64-linux/include/crt/host_config.h:132:2: error: #error -- unsupported GNU version! gcc versions later than 11 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.`

GCC, G++ 버전이 10이하를 요구한다.


conda install gcc=10.4.0 gxx=10.4.0 -y

성공 로그


(mlc-llm) w3090 :: ~/coding/mlc-llm 2 » python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1 --use-cache=0
Weights exist at dist/models/llama-2-ko-7b, skipping download.
Using path "dist/models/llama-2-ko-7b" for model "llama-2-ko-7b"
Target configured: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Start computing and quantizing weights... This may take a while.
Finish computing and quantizing weights.
Total param size: 3.5929031372070312 GB
Start storing to cache dist/llama-2-ko-7b-q4f16_1/params
[0327/0327] saving param_326
All finished, 116 total shards committed, record saved to dist/llama-2-ko-7b-q4f16_1/params/ndarray-cache.json
Finish exporting chat config to dist/llama-2-ko-7b-q4f16_1/params/mlc-chat-config.json
fuse_split_rotary_embedding -1
[05:12:01] /workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
[05:12:01] /workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound
Save a cached module to dist/llama-2-ko-7b-q4f16_1/mod_cache_before_build.pkl.
Finish exporting to dist/llama-2-ko-7b-q4f16_1/llama-2-ko-7b-q4f16_1-cuda.so

모델 REST API 서빙


python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params

성공 로그


(mlc-llm) w3090 :: ~/coding/mlc-llm 3 » python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params 
INFO:     Started server process [1235069]
INFO:     Waiting for application startup.
System automatically detected device: cuda
Using model folder: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params
Using mlc chat config: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params/mlc-chat-config.json
Using library model: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so

INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0]) 에러 발생시

Following Issue:

아직 해결안됨 😂

MLC LLM

설치

MLC-LLM 패키지 설치

No module named 'tvm' 에러시

모델 컨버팅

TypeError: runtime.Module is not registered via TVM_REGISTER_NODE_TYPE 에러

gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory

모델 REST API 서빙

Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0]) 에러 발생시

`No module named 'tvm'` 에러시

`TypeError: runtime.Module is not registered via TVM_REGISTER_NODE_TYPE` 에러

`gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory`