Generative Powers of Ten

1University of Washington, 2UC Berkeley, 3Google Research

CVPR 2024 (Highlight)

TL;DR: Given a set of prompts describing the scene at various scales, our method creates a seamless zooming video.


Abstract

We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. This representation allows us to render continuously zooming videos, or explore different scales of the scene interactively. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.

Method

Our method uses a pre-trained diffusion model to jointly denoise multiple images of the scene at various scales. Noisy images from each zoom level, along with the respective prompts are simultaneously fed into the same pretrained diffusion model, returning estimates of the corresponding clean images. These images may have inconsistent estimates for the overlapping regions that they all observe. We employ multi-resolution blending to fuse these regions into a consistent zoom stack and re-render the different zoom levels from the consistent representation. These re-rendered images are then used as the clean image estimates in the DDPM sampling step.

More Results







Zooming into a real image

We can guide one zoom level to match an input image, allowing us to zoom into a real image.



Diversity

By varying the seed, we can get different results for the same set of input prompts.







Baseline Comparisons

Another way to generate a zooming video is to either (1) progressively super-resolve a zoomed-out image with a text-conditioned super-resolution model or (2) progressively outpaint a zoomed-in image with a text-conditioned outpainting model. Here we compare with these two variants, using Stable Diffusion's super-resolution and outpainting models. We observe that causal generation typically results in inferior results, since prior generations are not always compatible with subsequent zoom levels.

Stable Diffusion Super-Resolution Stable Diffusion Outpainting Ours

Stable Diffusion Super-Resolution Stable Diffusion Outpainting Ours

Acknowledgements

This research project is inspired by the original 1977 Powers of Ten film, which originally showcased this type of continuous zoom effect. Our goal in this project is to create a similar animation automatically with a generative model, and also to enable the creation of these zoom videos from our own photos.

We also would like to thank Ben Poole, Jon Barron, Luyang Zhu, Ruiqi Gao, Tong He, Grace Luo, Angjoo Kanazawa, Vickie Ye, Songwei Ge, Keunhong Park, and David Salesin for helpful discussions and feedback.

BibTeX


    @article{wang2023generativepowers,
      title={Generative Powers of Ten},
      author={Xiaojuan Wang and Janne Kontkanen and Brian Curless and Steve Seitz and Ira Kemelmacher 
	      and Ben Mildenhall and Pratul Srinivasan and Dor Verbin and Aleksander Holynski},
      journal={arXiv preprint arXiv:2312.02149},
      year={2023}
    }