/images/logo.png
A notebook for something

Understand the temperature, top-p and top-k in LLMs

When generating text with large language models (LLMs), temperature, top-p (nucleus sampling), and top-k are parameters used to control the randomness and diversity of the generated output. Each of these parameters influences the probability distribution from which the next token (word or subword) is sampled. Here’s a breakdown of how each parameter is implemented internally:

1. Temperature

Temperature is a parameter that adjusts the probability distribution of the next token by scaling the logits (raw scores) output by the model.

llama v3

1
2
3
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.80 --dtype bfloat16 \
--model gradientai/Llama-3-8B-Instruct-Gradient-4194k
1
2
3
4
5
6
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.80 \
--dtype float16 \
--quantization awq \ 
--model casperhansen/llama-3-70b-instruct-awq

Build a docker image with support CUDA and conda

1. environment.yaml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
name: main
channels:
  - defaults
  - conda-forge
  - nvidia
  - pytorch
dependencies:
  - python==3.10
  - pip
  - numpy
  - pandas
  - pyarrow
  - grpcio
  - grpcio-tools
  - protobuf
  - pip:
    - vllm==0.3.0
    - google-cloud-bigquery==3.17.2
    - google-cloud-storage==2.14.0
    - google-cloud-aiplatform==1.41.0
    - google-auth==2.27.0
    - autoawq

2. requirements.txt

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
gradio
numpy
pandas
transformers
peft
accelerate
bitsandbytes
trl
xformers
wandb
datasets
einops
sentencepiece
pyarrow
langchain
python-dotenv
ray
sentence_transformers
BeautifulSoup4
grpcio
grpcio-tools
pymilvus
protobuf
jinja2
#litellm
vllm
openai
google-cloud-bigquery
google-cloud-storage
google-auth

3. Dockerfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# 
# reference: https://stackoverflow.com/questions/65492490/how-to-conda-install-cuda-enabled-pytorch-in-a-docker-container
#FROM nvidia/cuda:12.1.1-runtime-ubuntu20.04

FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

ENV HOME /app

RUN mkdir /app
WORKDIR /app


# set bash as current shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]

##################################
# install utils
##################################
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata
RUN apt install -y software-properties-common curl jq

##################################
# install gcloud
##################################
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
RUN mkdir -p /app/gcloud \
  && tar -C /app/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
  && /app/gcloud/google-cloud-sdk/install.sh
ENV PATH $PATH:/app/gcloud/google-cloud-sdk/bin

##################################
# install conda
##################################
ARG DEFAULT_ENV=main

RUN curl -OL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh  -b -f -p ${HOME}/miniconda3/ 
RUN ln -s ${HOME}/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
RUN echo ". ${HOME}/miniconda3/etc/profile.d/conda.sh" >> ${HOME}/.bashrc

ENV PATH ${HOME}/miniconda3/bin/:$PATH
ENV CONDA_DEFAULT_ENV ${DEFAULT_ENV}
ENV PATH ${HOME}/miniconda3/envs/${DEFAULT_ENV}/bin:$PATH

# create a conda env with name "main"
COPY environment.yaml environment.yaml
#RUN conda env update -f base.environment.yaml
RUN conda env create -f environment.yaml
RUN echo "conda activate main" >> ${HOME}/.bashrc
#RUN echo "export HUGGINGFACE_HUB_CACHE=/data/models" >> ${HOME}/.bashrc
#ENV HUGGINGFACE_HUB_CACHE

4. build.sh

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19

# this script will build a base docker image for CUDA
# us-central1-docker.pkg.dev/<PROJECT_ID>/prod/cuda-python:0.0.1
set -e # fail on any errors

PROJECT="<PROJECT_ID>"
ARTIFACT_REGISTRY_ROOT="us-central1-docker.pkg.dev"
ARTIFACT_REGISTRY_REPOSITORY="prod"


export HUGGINGFACE_HUB_CACHE="/data/models"

# cuda-12.1.1, python-3.10, transformers
export MODEL_IMAGE_ID=${ARTIFACT_REGISTRY_ROOT}/${PROJECT}/${ARTIFACT_REGISTRY_REPOSITORY}/${APP_NAME}:${APP_VERSION}

echo "Image: ${MODEL_IMAGE_ID}"

# build and push images
docker buildx build --platform linux/amd64 --push -t ${MODEL_IMAGE_ID} .

Deploy a vllm hosted LLM on k8s

In this post we show the steps to deploy a LLM with vllm on a GCP GKE environment.

1. Build a base docker image with cuda and conda

This docker image setup the env to host a LLM

  1. cuda:12.1.1
  2. miniconda
  3. a conda env named “main”, with python3.10 and some python libs

1.1 The dockerfile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04

ENV HOME /app
RUN mkdir /app
WORKDIR /app

# set bash as current shell
RUN chsh -s /bin/bash
SHELL ["/bin/bash", "-c"]

##################################
# install utils
##################################
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata
RUN apt install -y software-properties-common curl jq

##################################
# install gcloud
##################################
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
RUN mkdir -p /app/gcloud \
  && tar -C /app/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
  && /app/gcloud/google-cloud-sdk/install.sh
ENV PATH $PATH:/app/gcloud/google-cloud-sdk/bin

##################################
# install conda
##################################
ARG DEFAULT_ENV=main

RUN curl -OL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN bash Miniconda3-latest-Linux-x86_64.sh  -b -f -p /app/miniconda3/ 
RUN ln -s /app/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
RUN echo ". /app/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc

ENV PATH /app/miniconda3/bin/:$PATH
ENV CONDA_DEFAULT_ENV ${DEFAULT_ENV}
ENV PATH /app/miniconda3/envs/${DEFAULT_ENV}/bin:$PATH

# create a conda env with name "main"
COPY main.environment.yaml environment.yaml
#RUN conda env update -f base.environment.yaml
RUN conda env create -f environment.yaml
RUN echo "conda activate main" >> ~/.bashrc

1.2 The environment yaml for conda env “main”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
name: main
channels:
  - defaults
  - conda-forge
  - nvidia
  - pytorch
dependencies:
  - python==3.10
  - pip
  - numpy
  - pandas
  - pyarrow
  - grpcio
  - grpcio-tools
  - protobuf
  - pip:
    - vllm==0.3.0
    - transformers==4.37.2
    - google-cloud-bigquery==3.17.2
    - google-cloud-storage==2.14.0
    - google-cloud-aiplatform==1.41.0
    - google-auth==2.27.0

1.3 the build script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
set -e # fail on any errors

PROJECT=<GCP_PROJECT_ID>
ARTIFACT_REGISTRY_ROOT="us-central1-docker.pkg.dev"
ARTIFACT_REGISTRY_REPOSITORY="prod"

export NAMESPACE_ID=<K8S_NAMESPACE_ID>
export APP_NAME="cuda-python"
export APP_VERSION="0.0.1"
export DEPLOYMENT_NAME=${APP_NAME}-"deployment"

# cuda-12.1.1, python-3.10, transformers
export MODEL_IMAGE_ID=${ARTIFACT_REGISTRY_ROOT}/${PROJECT}/${ARTIFACT_REGISTRY_REPOSITORY}/${APP_NAME}:${APP_VERSION}

echo "Image: ${MODEL_IMAGE_ID}"

# build and push images
docker buildx build --platform linux/amd64 --push -t ${MODEL_IMAGE_ID} .
echo "Done"

2. Deploy a vllm service

Now we host a LLM. Here use “TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ” as an example.