login
Home / Papers / Computer Vision and Image Understanding

Computer Vision and Image Understanding

768 Citations2022
Wael Saideni, Fabien Courrèges, David Helbert

No TL;DR found

Abstract

In this work, we study a complete framework of Video Compressive Sensing (VCS), from capturing a sequence of video frames in one single compressed measurement to reconstructing the original frames. To our best knowledge, we present the first end-to-end sampling and recovery network built upon video Transformers, widely explored in vision related tasks, to capture long-range spatio-temporal relations. Our proposed Video Transformer for Snapshot Compressive Imaging recovery (ViT-SCI) is based on Spatio-temporal Convolutional Multi-Head Attention (ST-ConvMHA) which is an extended version of the fully-connected attention adapted for vision problems. Our comprehensive qualitative and quantitative experiments on several datasets demonstrate that ViT-SCI outperforms previous state-of-the-art methods with much faster reconstruction capacities, which pave the way for applying our algorithm in real-time applications. Indeed, ViT-SCI achieves high quality reconstruction on 64 × 64 video frames at the unprecedented rate of 1 frame per ms. In addition, an important ablation study on the Transformer network is provided to inspire future research works aiming to test the abilities of Transformers in vision tasks.