login
Home / Papers / Computer Vision and Image Understanding

Computer Vision and Image Understanding

768 Citations2022
Wael Saideni, Fabien Courrèges, David Helbert

No TL;DR found

Abstract

In this work, we study a complete framework of Video Compressive Sensing (VCS), from capturing a sequence of video frames in one single compressed measurement to reconstructing the original frames. To our best knowledge, we present the first end-to-end sampling and recovery network built upon video Transformers, widely explored in vision related tasks, to capture long-range spatio-temporal relations. Our proposed Video Transformer for Snapshot Compressive Imaging recovery (ViT-SCI) is based on Spatio-temporal Convolutional Multi-Head Attention (ST-ConvMHA) which is an extended version of the fully-connected attention adapted for vision problems. Our comprehensive qualitative and quantitative experiments on several datasets demonstrate that ViT-SCI outperforms previous state-of-the-art methods with much faster reconstruction capacities, which pave the way for applying our algorithm in real-time applications. Indeed, ViT-SCI achieves high quality reconstruction on 64 × 64 video frames at the unprecedented rate of 1 frame per ms. In addition, an important ablation study on the Transformer network is provided to inspire future research works aiming to test the abilities of Transformers in vision tasks.

Computer Vision and Image Understanding