설치
MLC-LLM 패키지 설치
- Conda 가상환경 필수! ⭐
# clone the repository git clone git@github.com:mlc-ai/mlc-llm.git --recursive # enter to root directory of the repo cd mlc-llm # install mlc-llm pip install . python3 -m mlc_llm.build --help # Verify
No module named 'tvm'
에러시
- https://mlc.ai/package/ ← 여기서 OS, CUDA버전 맞추서 받으면 된다.
# pip install -I mlc_ai_nightly -f https://mlc.ai/wheels pip install --pre --force-reinstall mlc-ai-nightly-cu118 mlc-chat-nightly-cu118 -f https://mlc.ai/wheels pip install -U pytest conda install -c conda-forge gcc
(mlc-llm) w3090 :: ~/coding/mlc-llm » python -c "import tvm; print(tvm.cuda().exist)" True
위와 같이 CUDA True가 나와야 한다.
모델 컨버팅
python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1 python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target vulkan --quantization q4f16_1
TypeError: runtime.Module is not registered via TVM_REGISTER_NODE_TYPE
에러
python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1 --use-cache=0 # 캐시 없이 실행 python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target vulkan --quantization q4f16_1 --use-cache=0 python3 -m mlc_llm.build --hf-path kfkas/Llama-2-ko-7b-Chat --target vulkan --quantization q4f16_1 --use-cache=0
gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory
g++
,gcc
두개가 버전 달라서 생기는 문제
(mlc-llm) w3090 :: ~/coding/mlc-llm » g++ --version g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. (mlc-llm) w3090 :: ~/coding/mlc-llm » gcc --version gcc (conda-forge gcc 13.1.0-0) 13.1.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
conda install -c conda-forge gxx # g++은 gxx로 설치함
/usr/local/cuda-11.8/bin/../targets/x86_64-linux/include/crt/host_config.h:132:2: error: #error -- unsupported GNU version! gcc versions later than 11 are not supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
- GCC, G++ 버전이 10이하를 요구한다.
conda install gcc=10.4.0 gxx=10.4.0 -y
성공 로그
(mlc-llm) w3090 :: ~/coding/mlc-llm 2 » python3 -m mlc_llm.build --hf-path beomi/llama-2-ko-7b --target cuda --quantization q4f16_1 --use-cache=0 Weights exist at dist/models/llama-2-ko-7b, skipping download. Using path "dist/models/llama-2-ko-7b" for model "llama-2-ko-7b" Target configured: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32 Start computing and quantizing weights... This may take a while. Finish computing and quantizing weights. Total param size: 3.5929031372070312 GB Start storing to cache dist/llama-2-ko-7b-q4f16_1/params [0327/0327] saving param_326 All finished, 116 total shards committed, record saved to dist/llama-2-ko-7b-q4f16_1/params/ndarray-cache.json Finish exporting chat config to dist/llama-2-ko-7b-q4f16_1/params/mlc-chat-config.json fuse_split_rotary_embedding -1 [05:12:01] /workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound [05:12:01] /workspace/tvm/include/tvm/topi/transform.h:1076: Warning: Fast mode segfaults when there are out-of-bounds indices. Make sure input indices are in bound Save a cached module to dist/llama-2-ko-7b-q4f16_1/mod_cache_before_build.pkl. Finish exporting to dist/llama-2-ko-7b-q4f16_1/llama-2-ko-7b-q4f16_1-cuda.so
모델 REST API 서빙
python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params
성공 로그
(mlc-llm) w3090 :: ~/coding/mlc-llm 3 » python3 -m mlc_chat.rest --model dist/Llama-2-ko-7b-Chat-q4f16_1/params INFO: Started server process [1235069] INFO: Waiting for application startup. System automatically detected device: cuda Using model folder: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params Using mlc chat config: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/params/mlc-chat-config.json Using library model: /ssd4t/coding-june/mlc-llm/dist/Llama-2-ko-7b-Chat-q4f16_1/Llama-2-ko-7b-Chat-q4f16_1-cuda.so INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
Argument split_rotary.cos_h.shape[0] has an unsatisfied constraint: -1 == T.Cast("int32", split_rotary_cos_h_shape[0]) 에러 발생시
- Following Issue:
- 아직 해결안됨 😂