Generalization in diffusion models arises from geometry-adaptive harmonic representation (Kadkhodaie, 2023)
- Model variance vanishes when $N$ increases → density implicitly represented by the DNN becomes independent of training set
- Increasing training set size $N$ substantially increases performance on test set, while worsening performance on train set
Questions / Thoughts:
- Subsets may still be similar despite being “non-overlapping” → learns the same distribution
- Model w/ 5,000 vs. 10,000: influential samples are supporting the same mode, therefore removing 5,000 still generates identical images
- Model is supported by a small amount of samples (influential)
- D-TRAK: why does removing ~32 samples work in counter-factual experiment?
Follow-up questions:
- Samples with timestep at the back = influential? → Ablation studies with retraining
- If learning of diffusion models have “phases,” it is more efficient to first train on the later timestep, then shift to the middle & beginning?