The mechanistic basis of data dependence and abrupt learning in an in-context classification task (Gautam Reddy, 2024)
Transformers Learn In-Context by Gradient Descent (von Oswald, 2023)
Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022 NIPS)
Burstiness: Items appear in clusters, rather than being uniformly distributed over time
Claims: ICL emerges when training data exhibits particular distributional properties
- Better in-context learning with more burstiness in training
- Burstiness in training data increases ICL, but decreases IWL
- Variation within class increases ICL, but decreases IWL: “making generalization problem harder increases ICL”
- Also align with the hypothesis that when input space isn’t well supported everywhere
- Over course of training, ICL tends to decrease while IWL increases
- Training data doesn’t provide enough knowledge → forces model to learn from context
- As model learns and generalizes to training data, becomes less reliant on context
- Models could simultaneously exhibit both ICL and IWL when trained on a skewed marginal distribution over classes (e.g. Zipfian distribution)
The mechanistic basis of data dependence and abrupt learning in an in-context classification task (ICLR 2024 Oral )
Main Arguments:
- ICL happens only for transformer under certain data distributions (burstiness, etc.), and ICL/IWL can coexist only if the data distribution is long-tailed. This essentially shows only transformer for language model can exhibit this behavior.
- But the paper did not explore how transformer is superior than other models. Olsson et al. 2022 proposed ‘induction head’ to explain this. In ICLR 2024, one paper proposes a minimalistic setting that reproduce the ICL/IWL tradeoffs. And they provide some mechanistic explanation on how does this induction head form.
- In-context Learning and Induction Heads (Olsson et al, 2022)