Spike Transformer: Monocular Depth Estimation for Spiking Camera
"Spiking camera is a bio-inspired vision sensor that mimics the sampling mechanism of the primate fovea, which has shown great potential for capturing high-speed dynamic scenes with a sampling rate of 40,000 Hz. Unlike conventional digital cameras, the spiking camera continuously captures photons and outputs asynchronous binary spikes that encode time, location, and light intensity. Because of the different sampling mechanisms, the off-the-shelf image-based algorithms for digital cameras are unsuitable for spike streams generated by the spiking camera. Therefore, it is of particular interest to develop novel, spike-aware algorithms for common computer vision tasks. In this paper, we focus on the depth estimation task, which is challenging due to the natural properties of spike streams, such as irregularity, continuity, and spatial-temporal correlation, and has not been explored for the spiking camera. We present Spike Transformer (Spike-T), a novel paradigm for learning spike data and estimating monocular depth from continuous spike streams. To fit spike data to Transformer, we present an input spike embedding equipped with a spatio-temporal patch partition module to maintain features from both spatial and temporal domains. Furthermore, we build two spike-based depth datasets. One is synthetic, and the other is captured by a real spiking camera. Experimental results demonstrate that the proposed Spike-T can favorably predict the scene’s depth and consistently outperform its direct competitors. More importantly, the representation learned by Spike-T transfers well to the unseen real data, indicating the generalization of Spike-T to real-world scenarios. To our best knowledge, this is the first time that directly depth estimation from spike streams becomes possible. Code and Datasets are available at https://github.com/Leozhangjiyuan/MDE-SpikingCamera."