[LLAMA] window 에서 text-generation-web-ui 4bit 모델 설치

교육/데이터 관련

[LLAMA] window 에서 text-generation-web-ui 4bit 모델 설치

가이버2 2023. 3. 14. 16:43

목표 : window 환경에서 text-generation-web-ui 로 llama 4bit 모델을 돌려보자.

링크 : https://github.com/oobabooga/text-generation-webui

(현재-자동) 설치파일 & 실행방법:

원클릭 설치 다운로드
https://github.com/oobabooga/text-generation-webui/releases/download/installers/oobabooga-windows.zip
다운로드후 압축해제 한뒤 install 파일 클릭하여 설치
모델 다운로드를 원한다면 download-model 더블클릭
프로그램 실행시 start-webui 더블클릭
이후 필요시 call python server.py --cai-chat 부분을 수정하여 원하는 모델을 실행

(기존-수동) 필요한것 :

엔비디아 드라이버 - 설치 생략
miniconda (python 3.10) - 설치 생략
윈도우 2019 빌드툴 (마소 계정 필요) https://visualstudio.microsoft.com/downloads/#remote-tools-for-visual-studio-2022
cuda 117 (엔비디아 계정 필요) https://developer.nvidia.com/cuda-toolkit-archive
cudnn 8.8 (엔비디아 계정 필요) https://developer.nvidia.com/rdp/cudnn-archive
llama 4bit 모델(토렌트) https://rentry.org/llama-tard-v2
(옵션) quant_cuda 파일 https://github.com/oobabooga/text-generation-webui/files/10947842/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl.zip
nvme ssd 에서 하는걸 추천

(기존-wsl) 유의

wsl2 로 설치후 구동시 램 반환이 안되서 1회 사용후 wsl 셧다운 해야됨.
wsl2 설치가 더 쉬우니 wsl2+cuda 설치후 리눅스 환경과 같이 설치해보고 램 누수가 없으면 그대로 사용 권장.
wsl2 램 32gb 제한후 가상메모리 32gb 사용시 속도도 빠르고 메모리도 회수되는걸로 보임. 실행되는 걸 본뒤 wsl 허용램을 조금씩 늘려서 가상메모리사용하지 않을 정도만 할당하고 사용하는것을 추천.
기본 모드
win/wsl 29.23s 6.84 token / 25.05s 7.99 token
채팅 모드
win
Output generated in 12.85 seconds (0.47 tokens/s, 6 tokens)
Output generated in 13.51 seconds (2.74 tokens/s, 37 tokens)
Output generated in 31.52 seconds (4.66 tokens/s, 147 tokens)

wsl
Output generated in 15.28 seconds (0.85 tokens/s, 13 tokens)
Output generated in 14.30 seconds (3.50 tokens/s, 50 tokens)
Output generated in 19.85 seconds (4.28 tokens/s, 85 tokens)

4bit 필요 사양

30B 기준으로 3090 과 32gb 램에 가상메모리 32gb 주시면 될겁니다.

(기존-원클릭 설치시 불필요)설치

윈도우 2019 빌드툴 설치
https://visualstudio.microsoft.com/downloads/#remote-tools-for-visual-studio-2022

2019 클릭후 다운로드 클 (윈도우 계정 필요)

설치(1.69GB -> 6.7GB) - 오래걸림

CUDA 117 설치 ( 드라이버와 같이 쉽게 설치 가능)
cudnn 8.8 설치(cuda 설치폴더에 덮어 씌우기)

llama 원본 + 4int 모델 다운로드

원본(변환된) 모델 폴더 + 4bit 처리모델 다운로드 (용량 매우큰 관계로 필요한 것만 쌍으로 다운로드)

셋팅

x64 Native Tools Command Prompt for VS 2019 실행

명령어 입력(유저이름 부분 수정필)

powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\Users\유저이름\miniconda3\shell\condabin\conda-hook.ps1' ; conda activate 'C:\Users\유저이름\miniconda3' "

폴더 이동( cd 명령어 & 유저이름 수정 필)

cd C:\Users\유저이름\Desktop

text-generation-web-ui 설치 (명령어 앞 base -> textgen으로 변경 확인)

conda create -n textgen
conda activate textgen
conda install torchvision torchaudio pytorch-cuda=11.7 git -c pytorch -c nvidia
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

4bit mode를 위한 GPTQ-for-LLaMa 설치

mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

트러블 슈팅(에러발생) - 환경변수 문제

/usr/bin/link: extra operand '/LTCG'
Try '/usr/bin/link --help' for more information.
error: command 'C:\\Users\\room\\miniconda3\\envs\\textgen\\Library\\usr\\bin\\link.exe' failed with exit code 1

해결방법 1 - 미리 컴파일된 파일 설치 - (옵션) quant_cuda 파일 설치후 확인

pip install .\quant_cuda-0.0.0-cp310-cp310-win_amd64.whl

Processing c:\users\room\desktop\text-generation-webui\repositories\gptq-for-llama\quant_cuda-0.0.0-cp310-cp310-win_amd64.whl
Installing collected packages: quant-cuda
Successfully installed quant-cuda-0.0.0

해결방법 2 - 환경변수 세팅후 컴파일 - 생략
llama 원본(변환된) 모델 + 4int 파일 models 파일 안으로 이동(오래걸림)

실행

cd ../..
python server.py --load-in-4bit

결과 확인

Loading llama-30b-hf...
Loading model ...
Done.
Loaded the model in 25.79 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

크롬으로 http://127.0.0.1:7860 접속

실행 확인 (답변이 이상할수도 있음)

실행 명령어 최적화(명령어등을 조합하며 맞게 사용)

python server.py --chat --model llama-30b --load-in-4bit --no-stream

출처 : 트러블 슈팅시 활용

가이드 : https://rentry.org/llama-tard-v2
깃헙위키 : https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode
깃헙위키 : https://github.com/oobabooga/text-generation-webui/wiki/Installation-instructions-for-human-beings
깃헙이슈 : https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/11
특이점 갤러리 : https://gall.dcinside.com/mgallery/board/lists?id=thesingularity
아카라이브 AI 채널 : https://arca.live/b/characterai/71515797?target=all&keyword=llama&p=1
아카라이브 AI 그림채널 : https://arca.live/b/aiart