[PPT] - Semantic Segmentation Dr. Eyal Gruss Director of AI, Flatspace PowerPoint Presentation

SLIDE 1

Semantic Segmentation

Dr. Eyal Gruss

Director of AI, Flatspace

SLIDE 2

Talpiyot PhD Physics Machine Learning

Researcher
Consultant
Entrepreneur

Digital Artist

Eyal Gruss

SLIDE 3

For photorealistic VR experience

3D Model

Using deep neural networks

Architectural Interpretation Bitmap Floorplan

An AI-powered service that creates a VR model from a simple floorplan.

Flatspace

Demo video: http://flatspace.xyz

SLIDE 4

28.19% 25.77% 16.42% 11.74% 6.66% 3.57% 2.99% 2.25% 5.10%

0% 5% 10% 15% 20% 25% 30% 2010 2011 2012 2013 2014 2015 2016 2017 Human level Top 5 classification error Move to deep neural networks: AlexNet

Image Recognition (ImageNet ILSVRC)

GoogLeNet Microsoft Residual Net

1.2M train images, 100k test images, 1000 categories

Trimps- Soushen Ministery

f public

security, China Karpathy Momenta/ Oxford

SLIDE 5

Object Detection and Recognition (ImageNet)

googleresearch.blogspot.com/2014/09/ building-deeper-understanding-of- images.html (Szegedy et al., GoogLeNet)

Live:

VGG
YOLO
YOLO v2
LeCun

Concurrence, Localization Occlusion Out of context Counting Tracking

SLIDE 6

Multi Instance Semantic Segmentation

Li et al., arxiv.org/abs/1611.07709

Won the COCO 2016 Detection Challenge (for segmentation)

SLIDE 7

Adversarial Perturbations Against Semantic Segmentation

Fischer et al., arxiv.org/abs/1703.01101 Xie et al., arxiv.org/abs/1703.08603 Metzen et al., arxiv.org/abs/1704.05712 Cisse et al., arxiv.org/abs/1707.05373

SLIDE 8

Other related tasks

Edge detection
Semantic edge detection
Surface normals
Matting / objectness (foreground/background)
Saliency / memorability
Pose estimation
Depth estimation
Optical flow interpolation and estimation
Motion prediction
E.g. Eigen and Fergus, UberNet, PixelNet

combine several of the above

SLIDE 9

This talk: Semantic Segmentation

aka: scene labeling / scene parsing / dense prediction / dense labeling / pixel-level classification

(d) Input (e) semantic segmentation (f) naive instance segmentation (g) instance segmentation (e) semantic segmentation

SLIDE 10

Datasets and use cases

General
Pascal VOC 2012
MS COCO (evaluation only for instance segmentation)
ADE20K / SceneParse150K (all pixels annotated)
DAVIS 2017 (video; review)
Urban (e.g. for autonomous vehicles)
Cityscapes (all pixels annotated)
CMP Facades (strong priors)
KITTI road/lane
CamVid (all pixels annotated, video)
Aerial / Satellite
ISPRS Potsdam and Vaihingen
DSTL Kaggle (multi-modal)
Human parsing (LIP, MHP)
Medial imaging (can be 2.5D/multi-view)
More: riemenschneider.hayko.at/vision/dataset

SLIDE 11

Pascal VOC 2012 11,530 6,929 20 + background Train+Validation: github.com/nightrome/really- awesome-semantic-segmentation

SLIDE 12

Evaluation metrics

Pixel accuracy (dominated by background class)
Mean accuracy over classes (individual class recall does not penalize false pos; must include

background class)

Jaccard index = Intersection over Union (IoU) = (GT ∩ Pred) / (GT U Pred) = TP / (TP + FN + FP)
<= Recall = TP / GT, Precision = TP / Pred
Usually: mean over classes, on the whole dataset
Can include or exclude the background class
Can be mean over images instead of whole dataset
Can be frequency weighted (unbalanced, similar to pixel accuracy)
Can be weighted by inverse instance size (cityscapes, important in traffic use cases)
Can be averaged with e.g. pixel accuracy (ADE20K)
Dice index = F1 score = 2(GT ∩ Pred) / (GT + Pred) = 2TP / (2TP + FN + FP)
= Harmonic mean of Recall and Precision
= 2IoU / (1 + IoU), Monotonic with IoU

SLIDE 13

Evaluation metrics

Trimap IoU around boundaries 4/8px (Krähenbühl and Koltun, Kohli et al.)
Boundary F1 (BF) - Nearest boundary pixel distance (Csurka et al.)
For some distance error tolerance = e.g. 0.75% of the image diagonal
Can be averaged with IoU (Davis)
Average precision (AP) = Area under the precision-recall curve (MS COCO)
Here precision, recall are instance-level given some IoU threshold e.g. 0.5
Can be additionally averaged over different thresholds (e.g. 0.5 - 0.95 in steps of 0.05)
Multiple detections (instance fragmentation) are counted as false positives beyond

the best

Primary metric for instance segmentation (pixel-level metrics can be ambiguous)

SLIDE 14

Loss

Cross entropy = - sumclasses sumpixels p*log(q)
p = targets; q = output probabilites
Can be weighted by inverse class size
Can be weighted to emphasize areas around edges (U-Net, Meyer)
IoU approximated with probabilities = sumclasses [(sumpixels p*q) / sumpixels (p + q – p*q)]
Approximation is needed since IOU is discrete
Makes sense since this is our evaluation metric
Multiclass formulation is balanced over class sizes
Rediscovered in literature from time to time [1-16]
Visualead reported mixed results
Loss =
IoU [1 2 3 4 5 6]
Dice [7 8 9 10]
Tversky generalization [11]
0.1 * CE + 0.9 * (1 - Dice) [12]
CE - log(IoU) [13]
Other approximations [14 15 16 (TBD in TF)]
Total variation smoothing = sumclasses sumx,y |qx+1,y – qx,y|+|qx,y+1 – qx,y|
Adversarial (later)

SLIDE 15

Architectures

1. Patchwise CNN 2. FCN 3. DeepLab 4. DeconvNet 5. U-Net 6. SegNet 7. Dilated Convolutions (Yu and Koltun) 8. 100-Layer Tiramisu (DesneNets) 9. Wide ResNet 10. PSPNet 11. Adversarial 12. PolygonRNN 13. Mask R-CNN 14. Semi-supervised with unsupervised loss

SLIDE 16

Patchwise CNN

Ning et al., http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf
Ciresan et al., people.idsia.ch/%7Ejuergen/nips2012.pdf
A sliding window CNN classifies each pixel in turn

SLIDE 17

Fully Convolutional NN

cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-

layers

Start from a CNN classifier
Convert fully connected to conv (with filter size = input volume, no padding):
CNN -> 7*7*512 -> fc(4096) -> 4096
> fc(1000) -> 1000
CNN -> 7*7*512 -> conv(7*7*4096) -> 1*1*4096 -> conv (1*1*1000) -> 1*1*1000
Can take arbitrarily larger input:
224*224 -> 7*7*512 -> 1*1*100
384*384 -> 12*12*512 -> 6*6*100
Equivalent to sliding a patchwise CNN, but

with a single pass that is much more efficient due to convolution sharing

SLIDE 18

Deconvolution/Upconvolution Layers

FC convolution transposed
cs231n.stanford.edu/slides/2017/

cs231n_2017_lecture11.pdf

Fractionally strided convolution
github.com/vdumoulin/conv_arithmetic

Stride = 2 Stride = 1/2 input

(Resolution Increasing Convolutions)

SLIDE 19

Fully Convolutional Network (FCN; 2014-11)

Long et al., arxiv.org/abs/1411.4038
Shelhamer et al., arxiv.org/abs/1605.06211
Start from classification CNN pre-trained on

ImageNet (AlexNet/VGG-16/GoogLeNet) and convert fully connected to conv (conv7)

Replace final layer to 1*1*21 and add bilinear

upsampling to get full spatial output (FCN-32s)

Add x2 deconv (initialized as bilinear) on conv7

and sum with conv prediction added to pool4

Add bilinear upsampling to get full spatial
utput (FCN-16s). Fine tune from FCN-32s
Do similarly for above fuse and pool3 (FCN-8s)
Pascal VOC 2012 IoU=62.2%-67.2% (up from 51.6%)
100-175 ms

(vs. 50 s)

134M params

SLIDE 20

DeepLab (2014-12)

Chen et al. (Google), arxiv.org/abs/1412.7062
VGG-16 pre-trained on ImageNet -> fully conv
Cancel last two max-pool
Change conv after above to x2/x4 dilated convolutions
Train with x8 subsampled targets (IoU<90.7%). Infer with bilinear upsampling.
Fully connected CRF (raw image dependent potential) post-processing in inference (+ 3%-5%)
Add multi-scale layers fine tuned separately (similar to FCN-8s but with concats and convs)
Increase dilation for first FC layer to x12 (large field of view) + change FC kernel, filters
20.5M params
Pascal VOC 2012 IoU = 71.6%
V2: arxiv.org/abs/1606.00915 with ResNet-101 + “atrous spatial pyramid pooling”
Pascal VOC 2012 IoU = 79.7% Cityscapes IoU = 70.4%
V3: arxiv.org/abs/1706.05587
Pascal VOC 2012 IoU = 86.9% Cityscapes IoU = 81.3% (SOTA 2017)

Before softmax After softmax

hole = atrous = dilated convolutions increase field of view without decreasing resolution,

r adding parameters

SLIDE 21

DeconvNet (2015-05)

Noh et al., arxiv.org/abs/1505.04366
VGG-16 pre-trained on ImageNet
Unpooling layers use saved max pooling indices
Symmetric encoder-decoder: multiple deconvolutions + BatchNorm + ReLU (no dropout)
Relies on region proposals. Training with two-stage curriculum learning:
1. Instances cropped to GT bounding boxes * 1.2, all non-class pixels labeled as background
2. Object proposals from edge-box * 1.2
Inference:
Top 50 objectness score of 2000 edge-box object proposals, Max per pixel/class before softmax
Fully connected CRF post-processing (+ ~1%)
252M params
Pascal VOC 2012 IoU = 70.5%
Ensemble with FCN-8s = 72.5%

SLIDE 22

U-Net (2015-05)

Ronneberger et al., arxiv.org/abs/1505.04597
No VGG! Not pre-trained!
Skip connections to keep res.!
Separate deconv to:

learned 2x2 upconv + (3x3 regular conv + ReLU) * 2

Weighting to emphasize areas

around morphological edges

Implementations I’ve seen

use half the filters and padding

dropout

SLIDE 23

SegNet (2015-11)

Badrinarayanan et al., arxiv.org/abs/1505.07293 arxiv.org/abs/1511.00561
VGG-16 pre-trained on ImageNet (without fully connected layers)
Unpooling layers use saved max pooling indices like in DeconvNet
Deconvolutions + BatchNorm + ReLU (some dropout)
They compare various decoders, and dropouts (arxiv.org/abs/1511.02680)
Pascal VOC 2012 IoU = 59.9%

SLIDE 24

Dilated Convolutions (2015-11)

Yu and Koltun, arxiv.org/abs/1511.07122
Front-end network + Context aggregation network
Front-end is a truncated VGG-16 like DeepLab + dilated convs,

pre-trained on Pascal VOC 2012

Context aggregation is a 7-layer uniform resolution dilated convs +

ReLUs, with increasing dilations and initialized to unit filters

Train with x8 subsampled targets. Front-end is trained first. Then context is added

and trained with fixed front-end

Possible post-processing with fully connected CRF / CRF-RNN
Front-end alone: Pascal VOC 2012 IoU = 71.3%
Front-end + Context + CRF-RNN: Pascal VOC 2012 IoU = 75.3%
Dilation10: Cityscapes IoU = 67.1%

SLIDE 25

The One Hundred Layers Tiramisu (2016-11)

Jegou et al., arxiv.org/abs/1611.09326
DenseNets (few params, easy training)
Encoder-Decoder with skip

connections

56 – 103 layers
1.5M – 9.4M params
No pre-training
No / negative results on large

benchmarks

SLIDE 26

Wide ResNet (2016-11)

Wu et al., arxiv.org/abs/1611.10080
Wider or Deeper Resnets? Wider!
See also Littwin and Wolf, arxiv.org/abs/1611.02525
Wide 7-block ResNet pre-trained for classification, adapted to dilated

a la DeepLab

Pascal VOC 2012 IoU = 82.5%
Cityscapes IoU = 78.4%
ADE20K avg(pixel acc., IoU) = 56.74%

SLIDE 27

Pyramid Scene Parsing (PSPNet; 2016-12)

Zhao et al., arxiv.org/abs/1612.01105 (trained models: Caffee, Keras)
Pre-trained dilated 101-269 ResNet + deep supervision auxiliary loss

+ pyramid pooling module

Pascal VOC 2012 IoU = 85.4% (1st place 2016)
Cityscapes IoU = 80.2% (1st place 2016). Video
ADE20K avg(pixel acc., IoU) = 57.21% (1st place 2016) SOTA!

(2016)

SLIDE 28

Mismatched Relationship Confusion Categories Inconspicuous Classes

SLIDE 29

Generative Adversarial Nets

Goodfellow et al., arxiv.org/abs/1406.2661 Generator

תרצוי

Discriminator

(Curator) תרצוא Fake or Real? Fake Real

SLIDE 30

Image to Image Translation

With Conditional Adversarial Networks (PatchGAN)

Isola et al., phillipi.github.io/pix2pix Interactive: affinelayer.com/pixsrv

Guide: ml4a.github.io/guides/Pix2Pix fotogenerator.npocloud.nl

SLIDE 31

Adversarial (2016-09)

Idea is to regulate naturalness (strong and smooth classes, sharp boundaries, denoising, global

structure)

David Golan et al. (2016-09, first one AFAIK)
Pix2pix, Isola et al., arxiv.org/abs/1611.07004
Generator is U-Net style (with skip connections)
4x4 Conv with stride 2 – BatchNorm - ReLU (+ some dropout). No max-pooing.
L1 loss for generator
“PatchGAN” Discriminator takes both image and segmentation, averages over 70x70 patches
Adversarial loss hurts! Cityscapes IoU = 29% 
L1 only Cityscapes IoU = 35% 
FAIR, Luc et al., arxiv.org/abs/1611.08408
Generator is Yu and Koltun’s Dilated8
Cross-entropy loss for generator
Discriminator issue: we feed it continuous probabilities (cannot do sgd with discrete labels), but GT are discrete
Tested product with image and scaling GT, as alternative input to discriminator, but results were the same
Pascal VOC 2012 IoU = 73.3% (compare to Yu and Koltun’s 71.3%). Adversarial ~ 2%
Several citations using this

SLIDE 32

PolygonRNN (2017-04)

Castrejon et al. CSC2523_Project_Report, arxiv.org/abs/1704.05548
Spare representation using polygons
Cityscapes IoU = 61.4% per instance, assuming given bounding boxes
Can speed-up manual annotation
CVPR 2017 Best Paper Honorable Mention Award (video)

(ConvLSTM)

SLIDE 33

Mask R-CNN (2017-03)

He et al., arxiv.org/abs/1703.06870 (tutorial)
Instance segmentation SOTA

SLIDE 34

Semi-Supervised Semantic Segmentation with Unsupervised Total Variation Loss

Javanmard et al.,

arxiv.org/abs/1605.01368

Supervised Proposed 10 pix/image 10 pix/image Full labels GT

SLIDE 35

Meta references

Janai et al., arxiv.org/abs/1704.05519 (chapter 6)
Garcia-Garcia et al., arxiv.org/abs/1704.06857
meetshah1995.github.io/semantic-segmentation/deep-

learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years

blog.qure.ai/notes/semantic-segmentation-deep-learning-review
handong1587.github.io/deep_learning/2015/10/09/segmentation
github.com/kjw0612/awesome-deep-vision#semantic-segmentation
github.com/mrgloom/Semantic-Segmentation-Evaluation
github.com/fchollet/keras/issues/6538

SLIDE 36

Thanks!

Slides: bit.ly/semantic-segmentation
Contact: eyal.gruss@gmail.com