Contents

Deploy OpenAI compatible LLM with vllm

The model we are going to use as an demo:

https://huggingface.co/WizardLM/WizardLM-13B-V1.2

Install packages

1
2
3
4
# the vllm 0.2.2 will NOT work with fschat==0.2.33, downgrade it to 0.2.23

export HUGGINGFACE_HUB_CACHE=/data/models
pip install vllm==0.2.2 fschat==0.2.23

Deploy WizardLM/WizardLM-13B-V1.2 to 4 Tesla v100 GPUs

1
2
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
--model WizardLM/WizardLM-13B-V1.2 --tensor-parallel-size 4

the output will be like:

1
2
3
4
5
6
7
8
9
INFO 11-26 09:01:11 api_server.py:638] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='WizardLM/WizardLM-13B-V1.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-11-26 09:01:12,911 WARNING services.py:1996 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-11-26 09:01:13,063 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-26 09:01:13 llm_engine.py:72] Initializing an LLM engine with config: model='WizardLM/WizardLM-13B-V1.2', tokenizer='WizardLM/WizardLM-13B-V1.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, seed=0)
INFO 11-26 09:01:58 llm_engine.py:207] # GPU blocks: 2474, # CPU blocks: 1310
INFO:     Started server process [3863]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The client to call the LLM

1
pip install -U openai==1.3.5
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#import openai
#openai.api_key = "EMPTY"
#openai.base_url = 'http://<service>.<namespace>.svc.cluster.local/v1/'

import json
import openai
from openai import OpenAI

assert openai.__version__ == "1.3.5"

client = OpenAI(api_key = "EMPTY",
                base_url = 'http://wayspot-util-service.wayspot-nomination.svc.cluster.local/v1/')


models = json.loads(client.models.list().json())
print(models)
model_id = models['data'][0]['id']  # "WizardLM/WizardLM-13B-V1.2"
completion = client.completions.create(model=model_id,
                                       prompt="plan a 3 day trip to Tokyo")

print(completion.choices[0].text)
print(dict(completion).get('usage'))
print(completion.model_dump_json(indent=4))


response = client.chat.completions.create(
  model=model_id,
  response_format={ "type": "json_object" },
  messages=[
    {"role": "system", "content": "You are a helpful assistant to write python code for me."},
    {"role": "user", "content": "implement an application in python that prints 'hello,world'"}
  ]
)

print(response.choices[0].message.content)