Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis
Long Zhuo, Guangcong Wang, Shikai Li, Wayne Wu, Ziwei Liu
"Video-to-Video synthesis (Vid2Vid) has achieved remarkable results on generating a photo-realistic video from a sequence of semantic maps. However, this pipeline suffers from high computational cost and long inference latency, which largely depends on two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly compressed via more efficient network architectures. Nevertheless, existing methods mainly focus on slimming network architectures and ignore the size of sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of video synthesis with a dynamic time series. In this paper, we present a spatial-temporal compression framework, Fast-Vid2Vid, which focuses on data aspects of generative models. It makes the first attempt at time dimension to reduce computational resources and accelerate inference. Specifically, we compress the input data stream spatially and reduce the temporal redundancy. After the proposed spatial-knowledge distillation, our model can synthesize key-frames using low-resolution data stream. Finally, Fast-Vid2Vid interpolates inter frames by motion compensation with negligible latency. On standard benchmarks, Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8× computational cost on a single V100 GPU."