World-Consistent Video-to-Video Synthesis
Video-to-video synthesis is a powerful tool for converting high-level semantic inputs to photorealistic videos. However, while existing vid2vid methods can maintain short-term temporal consistency, they fail to ensure long-term consistency in the outputs. This is because they generate each frame only based on the past few frames. They lack knowledge of the 3D world being generated. In this work, we propose a framework for utilizing all past generated frames when synthesizing each frame. This is achieved by condensing the 3D world generated so far into a physically-grounded estimate of the current frame, which we call the guidance image. A novel module is also proposed to take advantage of the information stored in the guidance images. Extensive experimental results on several challenging datasets verify the effectiveness of our method in achieving world consistency - the output video is consistent within the entire generated 3D world.
Using Adobe Acrobat Reader is highly recommended for playing the videos embedded in this submission."