Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro
Published as a conference paper at ICLR 2026
Visual ExamplesAfter memorizing a trajectory through the maze, and only given one context frame, the model generates frames that accurately reflect the correct turns and re-entries. Using absolute position encoding, our method preserves spatial consistency and successfully reconstructs complex navigation paths.
Given 50 context frames, here highlighted with a blue border, we visualize the next 4 rollouts. Our method produces coherent forward trajectories that reflect consistent agent movement and scene structure.
Given 50 context frames, here highlighted with a blue border, we visualize the next 4 rollouts. Our method produces coherent forward trajectories that reflect consistent agent movement and scene structure.
At the end of the sequence, the model is evaluated on its ability to regenerate occluded tiles. After being shown a sequence of uncovering and covering actions, such that all tiles were visible at some point, our method more accurately recalls the occluded tiles, here given as 'target', demonstrating effective discrete memory recall.
If you find our work useful, please cite:
@inproceedings{stapf2026come,
title = {Composition of Memory Experts for Diffusion World Models},
author = {Stapf, Sebastian and Acuaviva Huertos, Pablo and Davtyan, Aram and Favaro, Paolo},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}