Key Ideas
- At runtime, text generation is sequence of discrete decisions, and a ‘bad’ but more likely decision early on will prevent desired and more likely sequences from being generated
Notes
Single token generation (Naive)
Steps:
- Get a prefix
- Pass to language model
- Softmax over last token produces distribution over the vocabulary
- Do some decision on this distribution to select next token
- e.g. sample, highest probability, etc.
Greedy Decoding
- Repeat the naive approach postfix-size # of times or at
<EOS>token
Cons
- Space of outputs increases exponentially with prefix AND postfix size
- Cannot arrive at phrases which may have a higher probability together, but of which the first few individual tokens might be uncommon/unlikely
Beam Search
- Explore several hypotheses by keeping the top
ksequences after each decoding stepkideally 5-10- Prune all other lower probability branches
- Stop when all
kpaths have reached<EOS>token - Often, length normalisation on probability calculation is done to prevent unfair advantage for shorter terminated sequences.
-
For sequences with minimum length requirements, can set prob of
<EOS>to 0 if it appears prematurely - This method will still not guarantee the most probable sequence. A discarded option might have lead to a more probable sequence.
k=1is the same as greedy decoding- Low
ks yield incorrect outputs - High
ks yield short/generic outputs, plus is expensive to run
Sampling based Decoding
- Ancestral sampling: Randomly sample next word at every
tsteps - Top-
n-sampling: Randomly sample from truncated probability distribution ofnwords - Nucleus sampling / Top-
p-sampling: Randomly sample from truncated probability distribution of top-kwords such that sum of probabilities is belowpp = 100is ancestral samplingp = 0is greedy decoding
- Generation ends when
EOSis sampled or max number of tokens are generated
Misc
- Temperature (
t) is used to flatten or sharpen the peaks in a distribution. Flatter distributions will give tailing entries more likelihood.
Resources
- Dr. Mohit Iyyer’s UMass CS685 S24 Lecture 13
