The mechanistic basis of data dependence and abrupt learning in an in-context classification task (Gautam Reddy, 2024)

Transformers Learn In-Context by Gradient Descent (von Oswald, 2023)

Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022 NIPS)

Burstiness: Items appear in clusters, rather than being uniformly distributed over time

Claims: ICL emerges when training data exhibits particular distributional properties

The mechanistic basis of data dependence and abrupt learning in an in-context classification task (ICLR 2024 Oral )

Main Arguments:

  1. ICL happens only for transformer under certain data distributions (burstiness, etc.), and ICL/IWL can coexist only if the data distribution is long-tailed. This essentially shows only transformer for language model can exhibit this behavior.
  2. But the paper did not explore how transformer is superior than other models. Olsson et al. 2022 proposed ‘induction head’ to explain this. In ICLR 2024, one paper proposes a minimalistic setting that reproduce the ICL/IWL tradeoffs. And they provide some mechanistic explanation on how does this induction head form.
    1. In-context Learning and Induction Heads (Olsson et al, 2022)