S³Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data
Solving depth estimation with monocular cameras enables the possibility of widespread use of cameras as low-cost depth estimation sensors in applications such as autonomous driving and robotics. In order to learn such a scalable depth estimation model, we require a ton of data and labels which are targeted towards specific use-cases. Acquiring these labels is expensive and often requires a calibrated research platform to collect data, which can be unfeasible especially in situations where the terrain is unknown. There are two popular approaches that do not require annotated depth maps:(i) using labelled synthetic and unlabeled real data in an adversarial framework to predict more accurate depth, and (ii) unsupervised models which exploit geometric structure across space and time in monocular video frames. Ideally, we would like to leverage features provided by both approaches as they complement each other; however, existing methods do not adequately exploit these additive benefits. We present a self-supervised framework which combines these complementary features: We use synthetic as well as real images for training while exploiting geometric and temporal constraints. Our novel consolidated architecture performs better than existing state-of-the-art literature. We present a unique way to train this self-supervised framework, and achieve over 15% improvement over the previous supervised approaches with domain adaption and 10% over the previous self-supervised approaches."