In-Context Learning

Burstiness: Items appear in clusters, rather than being uniformly distributed over time

Claims: ICL emerges when training data exhibits particular distributional properties

Better in-context learning with more burstiness in training
Burstiness in training data increases ICL, but decreases IWL
Variation within class increases ICL, but decreases IWL: “making generalization problem harder increases ICL”
- Also align with the hypothesis that when input space isn’t well supported everywhere
Over course of training, ICL tends to decrease while IWL increases
- Training data doesn’t provide enough knowledge → forces model to learn from context
- As model learns and generalizes to training data, becomes less reliant on context
Models could simultaneously exhibit both ICL and IWL when trained on a skewed marginal distribution over classes (e.g. Zipfian distribution)

Main Arguments:

ICL happens only for transformer under certain data distributions (burstiness, etc.), and ICL/IWL can coexist only if the data distribution is long-tailed. This essentially shows only transformer for language model can exhibit this behavior.
But the paper did not explore how transformer is superior than other models. Olsson et al. 2022 proposed ‘induction head’ to explain this. In ICLR 2024, one paper proposes a minimalistic setting that reproduce the ICL/IWL tradeoffs. And they provide some mechanistic explanation on how does this induction head form.
1. In-context Learning and Induction Heads (Olsson et al, 2022)