Contents

Understand the temperature, top-p and top-k in LLMs

When generating text with large language models (LLMs), temperature, top-p (nucleus sampling), and top-k are parameters used to control the randomness and diversity of the generated output. Each of these parameters influences the probability distribution from which the next token (word or subword) is sampled. Here’s a breakdown of how each parameter is implemented internally:

1. Temperature

Temperature is a parameter that adjusts the probability distribution of the next token by scaling the logits (raw scores) output by the model.

Logits Scaling:

The model generates a set of logits for each possible token in the vocabulary. The temperature parameter 𝑇 is used to scale these logits. The formula for adjusting logits is:

scaled_logits = logits / 𝑇

Softmax Application:

After scaling, the adjusted logits are passed through a softmax function to convert them into probabilities.

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import torch
from torch.nn.functional import softmax

logits = torch.Tensor([9, 5, 1])

softmax(logits, 0) # tensor([9.8169e-01, 1.7980e-02, 3.2932e-04])

# set T=3
softmax(x/3, 0)  # tensor([0.7501, 0.1977, 0.0521])

# set T=0.5
softmax(x/0.5, 0)  # tensor([9.9966e-01, 3.3535e-04, 1.1250e-07])

So we can tell from the observations:

  • When 𝑇 > 1i, the logits are divided by a higher number, making the distribution more uniform (more exploration, less confident).
  • When 𝑇 < 1, the logits are divided by a smaller number, making the distribution peakier (more exploitation, more confident).

2. Top-k Sampling

Top-k sampling involves truncating the probability distribution to only the top π‘˜ most likely tokens, then sampling from this truncated distribution.

3. Top-p Sampling (Nucleus Sampling)

Top-p sampling dynamically selects a subset of tokens based on cumulative probability mass rather than a fixed number.

4. Summary

  • Temperature adjusts the sharpness of the probability distribution, making it either more uniform or more peaked.
  • Top-k limits the sampling to the top π‘˜ most probable tokens, thereby narrowing the choice pool.
  • Top-p selects a subset of tokens such that their cumulative probability mass exceeds a threshold 𝑝, allowing for more dynamic adjustment based on the distribution shape.

Each of these techniques can be used alone or in combination to balance between generating coherent, high-probability text and introducing diversity and creativity in the generated output.