/images/logo.png
A notebook for something

Deploy LLM with HF's TGI

remove snap

1
2
3
apt purge snap
apt autoremove --purge snapd
rm -fr /var/snap/*

installl docker

1
2
3
4
apt install docker.io
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg   && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |     sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |     sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit

test with TGI

TGI stands for Text Generation Inference from huggingface.

Intro: https://huggingface.co/docs/text-generation-inference/index

Code: https://github.com/huggingface/text-generation-inference

1
2
3
volume=$HUGGINGFACE_HUB_CACHE
model=HuggingFaceH4/zephyr-7b-beta
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model

Start a new terminal and submit a request

Deploy OpenAI compatible LLM with vllm

The model we are going to use as an demo:

https://huggingface.co/WizardLM/WizardLM-13B-V1.2

Install packages

1
2
3
4
# the vllm 0.2.2 will NOT work with fschat==0.2.33, downgrade it to 0.2.23

export HUGGINGFACE_HUB_CACHE=/data/models
pip install vllm==0.2.2 fschat==0.2.23

Deploy WizardLM/WizardLM-13B-V1.2 to 4 Tesla v100 GPUs

1
2
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
--model WizardLM/WizardLM-13B-V1.2 --tensor-parallel-size 4

the output will be like:

1
2
3
4
5
6
7
8
9
INFO 11-26 09:01:11 api_server.py:638] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='WizardLM/WizardLM-13B-V1.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-11-26 09:01:12,911 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-11-26 09:01:13,063 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-26 09:01:13 llm_engine.py:72] Initializing an LLM engine with config: model='WizardLM/WizardLM-13B-V1.2', tokenizer='WizardLM/WizardLM-13B-V1.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, seed=0)
INFO 11-26 09:01:58 llm_engine.py:207] # GPU blocks: 2474, # CPU blocks: 1310
INFO:     Started server process [3863]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The client to call the LLM

1
pip install -U openai==1.3.5
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#import openai
#openai.api_key = "EMPTY"
#openai.base_url = 'http://<service>.<namespace>.svc.cluster.local/v1/'

import json
import openai
from openai import OpenAI

assert openai.__version__ == "1.3.5"

client = OpenAI(api_key = "EMPTY",
                base_url = 'http://wayspot-util-service.wayspot-nomination.svc.cluster.local/v1/')


models = json.loads(client.models.list().json())
print(models)
model_id = models['data'][0]['id']  # "WizardLM/WizardLM-13B-V1.2"
completion = client.completions.create(model=model_id,
                                       prompt="plan a 3 day trip to Tokyo")

print(completion.choices[0].text)
print(dict(completion).get('usage'))
print(completion.model_dump_json(indent=4))


response = client.chat.completions.create(
  model=model_id,
  response_format={ "type": "json_object" },
  messages=[
    {"role": "system", "content": "You are a helpful assistant to write python code for me."},
    {"role": "user", "content": "implement an application in python that prints 'hello,world'"}
  ]
)

print(response.choices[0].message.content)

Inatll llama2-accessory

Simple Demo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
git clone https://github.com/Alpha-VLLM/LLaMA2-Accessory.git
cd LLaMA2-Accessory
conda create -n accessory python=3.10 -y
conda activate accessory
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
conda install -c nvidia cuda-nvcc


git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

ray Actor Pool

Simple Demo

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

import ray
from ray.util.actor_pool import ActorPool
from ray.util.accelerators import NVIDIA_TESLA_V100

ray.init(ignore_reinit_error=True, include_dashboard=False)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if DEVICE == "cuda" else torch.float32

NUM_GPU = 4

@ray.remote(num_cpus=1, num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
class MyWorker:
    def hello(self, payload):
	"""use GPU here, for example, load a model into GPU
	"""
	return 'predicted'

MODEL_POOL = ActorPool([WayspotEmbedding.remote() for _ in range(NUM_GPU)])


for payload in payloads:
    MODEL_POOL.submit(lambda a, v: a.predict.remote(v), row)

results = []
for idx in range(len(payloads)):
    try:
        results.append(MODEL_POOL.get_next())
    except Exception as e:
        pass

print(results)

monitor screen remotely

1. Server

to be installed on the Windows machine which screen will be seen remotely

Notice the file must be named as ‘xxx.pyw’ because we need to use pythonw.exe to run this app. With pythonw, the terminal windows will not pop up.

To enable the app autostarting when the host Windows started, run these steps:

  1. Win+R to open the Run window
  2. Input command ‘shell:autostart’
  3. copy this file xxx.pyw into the Startup folder

Reference: https://support.microsoft.com/en-us/windows/add-an-app-to-run-automatically-at-startup-in-windows-10-150da165-dcd9-7230-517b-cf3c295d89dd

install lightGBM-GPU on Ubuntu

1. libs

1
2
3
apt update && apt install -y cmake ocl-icd-opencl-dev libboost-all-dev
export LIBOPENCL=/usr/local/nvidia/lib64
mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

2. repo

1
2
3
4
5
6
7
8
9
git clone --recursive https://github.com/Microsoft/LightGBM 
cd LightGBM && mkdir build && cd build
cmake -DUSE_GPU=1 ..
make -j4
pip uninstall lightgbm
cd .. ; bash ./build-python.sh install --gpu

# if has issues on "import lightgbm as lgbm"
conda install -c conda-forge libstdcxx-ng=12

3. test

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11

m sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import lightgbm as lgbm

X,y = make_classification(n_samples=2000000, n_features=100, n_classes=2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = lgbm.LGBMClassifier(device="gpu")  # 5.3s
#model = lgbm.LGBMClassifier()  # 9.83s, 10CPU cores
model.fit(X_train, y_train)