High-speed CMOS video sensors play an increasingly important role in capturing rapid motion for manufacturing, research, and entertainment purposes. Though highly refined commercial systems have become more readily available, they remain costly due to the intensive data storage and throughput requirements of high speed capture. Moreover, there are upper limits on capture rates due to hardware limitations regardless of cost. Recent developments in compressive imaging have opened up new avenues for bypassing some of these challenges by encoding motion optical at capture time. We explore strategies for encoding motion in a single capture by translating a binary physical mask during the exposure. By comparing, in simulation, a variety of mask patterns and reconstruction algorithms, we determine a strategy that provides high quality results relative to ground truth as well as computationally efficient reconstruction. We demonstrate this strategy in a custom-built hardware prototype.
"High Spatio-Temporal Resolution Video with Compressed Sensing"
R. Koller, L. Schmid, N. Matsuda, T. Niederberger, L. Spinoulas, O. Cossairt, G. Schuster, and A. K. Katsaggelos
Opt. Express 23, 15992-16007 (2015)
During the course of image accumulation by the sensor, an occluding mask optically imposed on the sensor plane modulates scene motion. The mask contains a predetermined, fixed pattern created through an etching process. Modulation is achieved by translating the mask laterally using a piezoelectric stage.
Four different mask patterns considered in this work
(a) and (b) show a set of T different masks to be displayed on an SLM; (a) shows a set of thresholded Gaussian masks; (b) shows a set of masks that, when summed across the t direction, result to the same number of samples per pixel. The forefront masks presented in (c) and (d) depict a mask pattern which will be placed on a translating stage in order to produce the datacube shown, when translated horizontally on the x direction; (c) corresponds to a thresholded Gaussian mask; (d) corresponds to the optimized mask proposed in this work. The proposed mask, when translated, produces an average value that is identical across all pixels of the sensor.
Simulated Reconstruction Results
(a) The Monster and Road test scenes used for simulation; (b) Average reconstruction PSNRs for three different algorithms and the four studied masks. The SLM-based approach performs 1-3 dB better than the translating mask approach and the Normalized masks provide an increase of 1-3 dB in reconstruction quality, compared to their Random counterparts; (c) PSNR for each of the 36 reconstructed video frames of the Monster scene using our proposed Shifted-Normalized mask. Reconstruction quality varies due to varying motion between subsequent frames in the video sequence. OMP and GAP perform best, with CLS performing slightly worse. Note that the selection of 2x2x36 patch size for the GAP algorithm is because the code necessitated that the time dimension of the patch would be a multiple of the spatial patch size. The code was further tested with a 7x7x35 patch and resulted in performance comparable to the one of the CLS/HP algorithm (not reported here for consistency in the presented number of total reconstructed frames). FISTA and L1_LS lead to the worst reconstruction quality; (d) PSNR vs Runtime comparison using 4 algorithms for the reconstruction of the Monster scene. CLS provides the best balance between reconstruction quality and speed.
Hardware System Overview
A moving scene (1) is imaged through an objective lens (2) onto an etched glass mask (4). A piezoelectric stage (3) translates the mask laterally during the exposure. A relay lens (5) images the mask and scene onto a CMOS sensor (6), which accumulates the motion of the scene modulated by the moving mask into a 2D image (7).
Example Captured Image
Mask modulation is visible in moving areas of the raw captured image. In this example, a playing card affixed to a metronome moves through dozens of pixels in image space during the time of exposure, which would result in motion blur using a conventional camera. With the moving mask in place, the card motion is instead spatially encoded such that different pixel subsets correspond to different time slices within the overall exposure. This spatiotemporal encoding forms the input to the reconstruction algorithm.
Coded Mask Detail
(a) Real capture of a static mask where fiducial calibration lines are visible; (b) Real capture while the mask moves by 10 pixes or 45 μm horizontally (full duration of acquisition); (c) Microscope images of two different coded masks showing imperfections of the fabrication process.
Real Reconstruction 1
Reconstruction of the Metronome scene using the Constrained Least-Squares method with a high-pass filter. The scene consists of a small amounts of translating motion. Parts (a)-(b) show the captured image; Parts (c)-(h) present 3 of the 10 reconstructed frames; Part (i) depicts closeups on the translating "Jack". Please see the complete video in Media 1 below.
Real Reconstruction 2
Reconstruction of the Ball scene using the Constrained Least-Squares method with a high-pass filter. The scene consists of large amounts of translating motion. Parts (a)-(b) show the captured image; Parts (c)-(h) present 3 of the 10 reconstructed frames; Part (i) depicts closeups for the falling soccer ball. Please see the complete video in Media 2 below and the corresponding Orthogonal Matching Pursuit/Dictionary reconstruction in Media 3.
Real Reconstruction 3
Reconstruction of the Deck of Cards scene using the Constrained Least-Squares method with a high-pass filter. The scene consists of large amounts of arbitrary motion. Parts (a)-(b) show the captured image; Parts (c)-(h) present 3 of the 10 reconstructed frames; Part (i) depicts closeups of the rotating "Jack". Please see the complete video in Media 4.