Image and Video Synthesis: Stable Diffusion, VQGAN, V-UNET, Etc.

Deep generative models have matured to the point where they are transforming the way we create visual content. We explore powerful generative approaches such as Invertible Neural Networks, autoregressive Transformers, and Diffusion Models. We investigate their specific limitations to develop novel strategies that unleash the full potential of these architectures. Among others, this led to latent approaches such as VQGAN and Stable Diffusion and the disentanglement of shape and appearance in V-UNET. Our long-standing goal is to develop algorithms that make images accessible on a semantic level to simplify our interaction with computers and to democratize the availability of this enabling technology.

Talk given in August 2021

Selected Publications

2022

Blattmann, Andreas; Rombach, Robin; Oktay, Kaan; Ommer, Björn

Retrieval-Augmented Diffusion Models Conference

Neural Information Processing Systems (NeurIPS), 2022., 2022.

Links | BibTeX

Rombach, Robin; Blattmann, Andreas; Ommer, Björn

Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models Conference

Proceedings of the European Conference on Computer Vision (ECCV) Workshop on Visart, 2022.

Links | BibTeX

Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn

High-Resolution Image Synthesis with Latent Diffusion Models Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Links | BibTeX

2021

Esser, Patrick; Rombach, Robin; Blattmann, Andreas; Ommer, Björn

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis Conference

Neural Information Processing Systems (NeurIPS), 2021.

Links | BibTeX

Blattmann, Andreas; Milbich, Timo; Dorkenwald, Michael; Ommer, Björn

iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis Conference

Proceedings of the International Conference on Computer Vision (ICCV), 2021.

Links | BibTeX

Dorkenwald, Michael; Milbich, Timo; Blattmann, Andreas; Rombach, Robin; Derpanis, Konstantinos G.; Ommer, Björn

Stochastic Image-to-Video Synthesis using cINNs Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Links | BibTeX

Blattmann, Andreas; Milbich, Timo; Dorkenwald, Michael; Ommer, Björn

Behavior-Driven Synthesis of Human Dynamics Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Links | BibTeX

Esser, Patrick; Rombach, Robin; Ommer, Björn

Taming Transformers for High-Resolution Image Synthesis Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Links | BibTeX

Rombach, Robin; Esser, Patrick; Ommer, Björn

Geometry-Free View Synthesis: Transformers and no 3D Priors Conference

Proceedings of the Intl. Conf. on Computer Vision (ICCV), 2021.

Links | BibTeX

Blattmann, Andreas; Milbich, Timo; Dorkenwald, Michael; Ommer, Björn

Understanding Object Dynamics for Interactive Image-to-Video Synthesis Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Abstract | Links | BibTeX

Brattoli, Biagio; Büchler, Uta; Dorkenwald, Michael; Reiser, Philipp; Filli, Linard; Helmchen, Fritjof; Wahl, Anna-Sophia; Ommer, Björn

Unsupervised behaviour analysis and magnification (uBAM) using deep learning Journal Article

In: Nature Machine Intelligence, 2021.

Abstract | Links | BibTeX

@article{7045,

title = {Unsupervised behaviour analysis and magnification (uBAM) using deep learning},

author = {Biagio Brattoli and Uta Büchler and Michael Dorkenwald and Philipp Reiser and Linard Filli and Fritjof Helmchen and Anna-Sophia Wahl and Björn Ommer},

url = {https://utabuechler.github.io/behaviourAnalysis/

https://rdcu.be/ch6pL},

doi = {https://doi.org/10.1038/s42256-021-00326-x},

year  = {2021},

date = {2021-01-01},

urldate = {2021-01-01},

journal = {Nature Machine Intelligence},

abstract = {Motor behaviour analysis is essential to biomedical research and clinical diagnostics as it provides a non-invasive strategy for identifying motor impairment and its change caused by interventions. State-of-the-art instrumented movement analysis is time- and cost-intensive, because it requires the placement of physical or virtual markers. As well as the effort required for marking the keypoints or annotations necessary for training or fine-tuning a detector, users need to know the interesting behaviour beforehand to provide meaningful keypoints. Here, we introduce unsupervised behaviour analysis and magnification (uBAM), an automatic deep learning algorithm for analysing behaviour by discovering and magnifying deviations. A central aspect is unsupervised learning of posture and behaviour representations to enable an objective comparison of movement. Besides discovering and quantifying deviations in behaviour, we also propose a generative model for visually magnifying subtle behaviour differences directly in a video without requiring a detour via keypoints or annotations. Essential for this magnification of deviations, even across different individuals, is a disentangling of appearance and behaviour. Evaluations on rodents and human patients with neurological diseases demonstrate the wide applicability of our approach. Moreover, combining optogenetic stimulation with our unsupervised behaviour analysis shows its suitability as a non-invasive diagnostic tool correlating function to brain plasticity.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Jahn, Manuel; Rombach, Robin; Ommer, Björn

High-Resolution Complex Scene Synthesis with Transformers Conference

CVPR 2021, AI for Content Creation Workshop, 2021.

Abstract | Links | BibTeX

Afifi, Mahmoud; Derpanis, Konstantinos G; Ommer, Björn; Brown, Michael S

Learning Multi-Scale Photo Exposure Correction Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Links | BibTeX

Kotovenko, Dmytro; Wright, Matthias; Heimbrecht, Arthur; Ommer, Björn

Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Abstract | Links | BibTeX

2020

Dorkenwald, Michael; Büchler, Uta; Ommer, Björn

Unsupervised Magnification of Posture Deviations Across Subjects Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Links | BibTeX

Esser, Patrick; Rombach, Robin; Ommer, Björn

A Note on Data Biases in Generative Models Conference

NeurIPS 2020 Workshop on Machine Learning for Creativity and Design, 2020.

Abstract | Links | BibTeX

Braun, Sandro; Esser, Patrick; Ommer, Björn

Unsupervised Part Discovery by Unsupervised Disentanglement Conference

Proceedings of the German Conference on Pattern Recognition (GCPR) (Oral), Tübingen, 2020.

Abstract | Links | BibTeX

Esser, Patrick; Rombach, Robin; Ommer, Björn

A Disentangling Invertible Interpretation Network for Explaining Latent Representations Conference

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Abstract | Links | BibTeX

Rombach, Robin; Esser, Patrick; Ommer, Björn

Network Fusion for Content Creation with Conditional INNs Conference

CVPRW 2020 (AI for Content Creation), 2020.

Abstract | Links | BibTeX

@conference{7012,

title = {Network Fusion for Content Creation with Conditional INNs},

author = {Robin Rombach and Patrick Esser and Björn Ommer},

url = {https://compvis.github.io/network-fusion/

https://arxiv.org/abs/2005.13580},

year  = {2020},

date = {2020-01-01},

urldate = {2020-01-01},

booktitle = {CVPRW 2020 (AI for Content Creation)},

abstract = {Artificial Intelligence for Content Creation has the potential to reduce the amount of manual content creation work significantly. While automation of laborious work is welcome, it is only useful if it allows users to control aspects of the creative process when desired. Furthermore, widespread adoption of semi-automatic content creation depends on low barriers regarding the expertise, computational budget and time required to obtain results and experiment with new techniques. With state-of-the-art approaches relying on task-specific models, multi-GPU setups and weeks of training time, we must find ways to reuse and recombine them to meet these requirements. Instead of designing and training methods for controllable content creation from scratch, we thus present a method to repurpose powerful, existing models for new tasks, even though they have never been designed for them. We formulate this problem as a translation between expert models, which includes common content creation scenarios, such as text-to-image and image-to-image translation, as a special case. As this translation is ambiguous, we learn a generative model of hidden representations of one expert conditioned on hidden representations of the other expert. Working on the level of hidden representations makes optimal use of the computational effort that went into the training of the expert model to produce these efficient, low-dimensional representations. Experiments demonstrate that our approach can translate from BERT, a state-of-the-art expert for text, to BigGAN, a state-of-the-art expert for images, to enable text-to-image generation, which neither of the experts can perform on its own. Additional experiments show the wide applicability of our approach across different conditional image synthesis tasks and improvements over existing methods for image modifications.},

keywords = {},

pubstate = {published},

tppubtype = {conference}

}