[PPT] - Bidirectional Recurrent Convolutional Networks for Video PowerPoint Presentation

SLIDE 1

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Qi Zhang & Yan Huang

May 10, 2017

Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA)

SLIDE 2

CRIPAC

CRIPAC mainly focuses on the following research topics related to national public security.

Biometrics
Image and Video Analysis
Big Data and Multi-modal Computing
Content Security and Authentication
Sensing and Information Acquisition

CRAPIC receives regular fundings from various Government departments

r

agencies. It is also supported by funds of R&D projects from many other national and international sources. CRIPAC members publish widely in leading national and international journals and conferences such as IEEE Transactions on PAMI, IEEE Transactions on Image Processing, International Journal of Computer Vision, Pattern Recognition, Pattern Recognition Letters, ICCV, ECCV, CVPR, ACCV, ICPR,

ICIP, etc.

http://cripac.ia.ac.cn/en/EN/volumn/home.shtml

2

SLIDE 3

NVAIL

Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning

3

SLIDE 4

Outline

4

2 Recurrent Convolutional Networks 4 Future Work 1 Deep Learning 3 Application to Video Super-Resolution

SLIDE 5

Outline

5

2 Recurrent Convolutional Networks 4 Future Work 1 Deep Learning 3 Application to Video Super-Resolution

SLIDE 6

Deep Neural Networks (DNN)

6

Originate from:
1962 – simple/complex cell, Hubel and Wiesel
1970 – efficient error backpropagation, Linnainmaa
1979 – deep neocognitron, convolution, Fukushima
1987 – autoencoder, Ballard
1989 – backpropagation for CNN, Lecun
1991 – fundamental deep learning problem, Hochreiter
1991 – deep recurrent neural network, Schmidhuber
1997 – supervised LSTM RNN, Schmidhuber

Large numbers of parameters → High computational cost Small training set → Over-fitting problem

Two drawbacks:

SLIDE 7

Two Recent Developments

7

Big Data

10000 13000 18200 27300 54600 87360 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

2009 2010 2011 2012 2013 2014

Video surveillance data size (PB)

Cheap Computation DNN can thus be fitted efficiently

SLIDE 8

Breakthrough in 2006 ImageNet: 74% vs. 85%

Deep Learning：The Resurgence of DNN

8 2006 2012 2014

Activity recognition, CVPR2015 Video caption, CVPR2015

RNN for sequence analysis Representation learning

DeepFace, CVPR2014 RCNN for detection, CVPR2014

∙∙∙∙∙∙

CNN for visual tasks

Deep Learning promotes the fast development

f various visual computing areas

SLIDE 9

Outline

9

2 Recurrent Convolutional Networks 4 Future Work 1 Deep Learning 3 Application to Video Super-Resolution

SLIDE 10

Deep Neural Networks (DNN)

10

𝐲 𝐢 𝐳 𝐗

𝐲 ∈ ℝ𝑒, 𝐢 ∈ ℝ𝑜, 𝐗 ∈ ℝ𝑒×𝑜
𝐢 = 𝜏 𝐲𝐗 ,

𝜏 𝑢 =

1 1+𝑓−𝑢 Sigmoid function 𝜏 𝑢

SLIDE 11

Recurrent Neural Networks (RNN)

11

𝐲𝟐 𝐢𝟐 𝐗 𝐲𝟑 𝐢𝟑 𝐗 𝐲𝟒 𝐢𝟒 𝐗 𝐕 𝐕

𝐢𝒖 = 𝜏 𝐲𝒖𝐗 + 𝐢𝒖−𝟐𝐕
𝐲𝐮 ∈ ℝ𝑒, 𝐢𝐮 ∈ ℝ𝑜, 𝐗 ∈ ℝ𝑒×𝑜, 𝐕 ∈ ℝ𝑜×𝑜

𝐳 𝐲 𝐢 𝐗 𝐳

RNN DNN

Temporal dependency modeling

SLIDE 12

Recurrent Convolutional Networks (RCN)

12 DNN

Convolutional

CNN

Sequential

RNN

Sequential Convolutional

RCN

DNN: Deep Neural Networks RNN: Recurrent Neural Networks CNN: Convolutional Neural Networks

SLIDE 13

Applications of RCN

13 Scene Labeling, NIPS15 Action Recognition, ICLR15 Video SR, NIPS15 & TPAMI17 Weather Nowcasting, NIPS15 Person ReID, CVPR16 Object Recognition, CVPR15

SLIDE 14

Outline

14

2 Recurrent Convolutional Networks 4 Future Work 1 Deep Learning 3 Application to Video Super-Resolution

SLIDE 15

Video Super-Resolution

15 A great need for super resolving low-resolution videos

High-resolution devices Display Low-resolution videos High-resolution videos Super-resolution: denoising, deblurring, upscaling Display

SLIDE 16

16

1. Single-Image super-resolution [1-6]

Two Main Approaches (1/2)

One-to-One scheme, super resolve each video frame independently Ignore the intrinsic temporal dependency relation of video frames

[1] Dong et al., Learning a deep convolutional network for image super resolution. ECCV, 2014. [2] Timofte et al., Anchored neighborhood regression for fast example-based super resolution. ICCV, 2013. [3] Zeyde et al., On single image scale-up using sparse-representations. Curves and Surfaces, 2012. [4] Yang et al., Image super-resolution via sparse representation. IEEE TIP, 2010. [5] Bevilacqua et al., Low-complexity single-image super resolution. BMVC, 2012. [6] Chang et al., Super-resolution through neighbor embedding. CVPR, 2004.

Low computational complexity, fast

SLIDE 17

17

⋯

Two Main Approaches (2/2)

[7] Liu and Sun, On bayesian adaptive video super resolution. IEEE PAMI, 2014. [8] Takeda et al., Super-resolution without explicit subpixel motion estimation. IEEE TIP, 2009. [9] Mitzel et al., Video super resolution using duality based tv-l 1 optical flow. PR, 2009. [10] Protter et al. Generalizing the nonlocal-means to super-resolution reconstruction. IEEE TIP, 2009. [11] Fransens et al., Optical flow based super-resolution: A probabilistic approach. CVIU, 2007.

2. Multi-Frame super-resolution [7-11]

Model the temporal dependency relation by motion estimation Many-to-One scheme, use multiple adjacent frames to super resolve a frame High computational complexity, slow

SLIDE 18

Motivation

18

RNN can model long-term contextual information of

temporal sequences well

Convolutional operation can scale to full videos of any

spatial size and temporal step ➢ Propose bidirectional recurrent convolutional networks, different from vanilla RNN:

RNN: Recurrent Neural Networks SR: Super-Resolution

1. Commonly-used full connections are replaced

with weight -sharing convolutions

2. Conditional convolutions are added for learning

visual-temporal dependency relation

SLIDE 19

Bidirectional Recurrent Convolutional Networks

19 learn spatial dependency between a low-resolution frame and its high- resolution result model long-term temporal dependency relation across video frames enhance visual-temporal dependency relation modeling

SLIDE 20

Learning

20

Define an end-to-end mapping 𝑃 ∙ from low-resolution

frames 𝒴 to high-resolution frames 𝒵

Learning proceeds by optimizing the Mean Square Error

(MSE) between predicted frames 𝑃(𝒴) and 𝒵 𝑀 = 𝑃 𝒴 − 𝒵

2 – stochastic gradient descent – small learning rate in the output layer: 1e-4

SLIDE 21

Experiments

21

Train the model on 25 YUV format

video sequences

– volume-based training – number of volumes: roughly 41,000 – volume size: 32 × 32 × 10

Test on a variety of real world

videos

– severe motion blur – motion aliasing – complex motions

Training videos Testing videos

SLIDE 22

PSNR Comparison

22 Table1: The results of PSNR (dB) and test time (sec) on the test video sequences.

PSNR: peak signal-to-noise ratio

[1] Video enhancer. http://www.infognition.com/videoenhancer/, version 1.9.10. 2014. [4] Bevilacqua et al., Low-complexity single-image super resolution. BMVC, 2012. [5] Chang et al., Super-resolution through neighbor embedding. CVPR, 2004. [6] Dong et al., Learning a deep convolutional network for image super resolution. ECCV, 2014. [20] Takeda et al., Super-resolution without explicit subpixel motion estimation. IEEE TIP, 2009. [22] Timofte et al., Anchored neighborhood regression for fast example-based super resolution. ICCV, 2013. [24] Yang et al., Image super-resolution via sparse representation. IEEE TIP, 2010. [25] Zeyde et al., On single image scale-up using sparse-representations. Curves and Surfaces, 2012.

Surpass state-of-the-art methods in PSNR, due to the effective temporal dependency modelling

SLIDE 23

Investigate the impact of our model architecture on the

performance

Take a simplified network containing only feedfoward (𝑤)

convolution as a benchmark

Study its variants by successively adding the bidirectional (𝑐),

recurrent (𝑠)and conditional (𝑢) schemes

Model Architecture

23 Table1: The results of PSNR (dB) by variants of BRCN on the testing video sequences.

SLIDE 24

Running Time

24 Figure: Speed vs. PSNR for all the comparison methods.

Outperform both single-image and multi-frame SR methods Achieve comparable speed with the fastest single-image SR methods

SLIDE 25

Closeup Comparison

25 Figure: Comparison among original frames (2th, 3th and 4th frames, from the top row to the bottom) of the Dancing video and super resolved results by Bicubic, 3DSKR, ANR and BRCN, respectively.

Our method is able to recover more image details than others, under severe motion conditions

SLIDE 26

Example

26 Upscaling factor:4

87 × 157 → 348 × 628

Comparison:

Bicubic (top) Ours (bottom)

SLIDE 27

Conclusion

27

Bidirectional Recurrent Convolutional Networks

– bidirectional recurrent and conditional convolutions – an end-to-end framework, without pre/post-processing – well performance and fast speed

For more details, please refer to the following papers:

1. Yan Huang, Wei Wang, and Liang Wang, Bidirectional Recurrent

Convolutional Networks for Multi-Frame Super-Resolution. Advances in Neural Information Processing Systems (NIPS), pp. 235-243, 2015

2. Yan Huang, Wei Wang, and Liang Wang, Video Super-Resolution via

Bidirectional Recurrent Convolutional Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017, Accepted

SLIDE 28

Outline

28

2 Recurrent Convolutional Networks 4 Future Work 1 Deep Learning 3 Application to Video Super-Resolution

SLIDE 29

Future Work

29

➢ For performance improvement

– extend our model to have a deeper architecture, e.g., based on 19 layers VGG net – incorporate some effective strategies, e.g., motion ensemble and residual connection

➢ For speed acceleration

– replace the used pre-upsampling by learning diverse upsampling filters with deconvolution layers

➢ Others

– collect a large-scale high-resolution video dataset, and try to learn our model directly from raw videos

SLIDE 30

Acknowledgement NVAIL

Artificial Intelligence Laboratory Sponsor excellent hardware resources

30

SLIDE 31