MVSNet: Depth Inference for Unstructured Multi-view Stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, Long Quan; The European Conference on Computer Vision (ECCV), 2018, pp. 767-783


We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed method is demonstrated on the large-scale DTU dataset. With simple post-processing, MVSNet not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. In the end, we also show the generalization power of MVSNet on the complex outdoor Tanks and Temples dataset, which has not been used to train the network.

Related Material

author = {Yao, Yao and Luo, Zixin and Li, Shiwei and Fang, Tian and Quan, Long},
title = {MVSNet: Depth Inference for Unstructured Multi-view Stereo},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}