Composition of Memory Experts for Diffusion World Models

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

Published as a conference paper at ICLR 2026

Visual Examples

Qualitative Results

MemoryMaze

After memorizing a trajectory through the maze, and only given one context frame, the model generates frames that accurately reflect the correct turns and re-entries. Using absolute position encoding, our method preserves spatial consistency and successfully reconstructs complex navigation paths.

RealEstate10K

Given 50 context frames, here highlighted with a blue border, we visualize the next 4 rollouts. Our method produces coherent forward trajectories that reflect consistent agent movement and scene structure.

DMLab

Given 50 context frames, here highlighted with a blue border, we visualize the next 4 rollouts. Our method produces coherent forward trajectories that reflect consistent agent movement and scene structure.

MemoryCards

At the end of the sequence, the model is evaluated on its ability to regenerate occluded tiles. After being shown a sequence of uncovering and covering actions, such that all tiles were visible at some point, our method more accurately recalls the occluded tiles, here given as 'target', demonstrating effective discrete memory recall.

Citation

If you find our work useful, please cite:

@inproceedings{stapf2026come,
  title     = {Composition of Memory Experts for Diffusion World Models},
  author    = {Stapf, Sebastian and Acuaviva Huertos, Pablo and Davtyan, Aram and Favaro, Paolo},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}