Semantic Segmentation Dr. Eyal Gruss Director of AI, Flatspace - - PowerPoint PPT Presentation

semantic segmentation
SMART_READER_LITE
LIVE PREVIEW

Semantic Segmentation Dr. Eyal Gruss Director of AI, Flatspace - - PowerPoint PPT Presentation

Semantic Segmentation Dr. Eyal Gruss Director of AI, Flatspace Eyal Gruss Talpiyot PhD Physics Machine Learning Researcher Consultant Entrepreneur Digital Artist Flatspace An AI-powered service that creates a VR model from a


slide-1
SLIDE 1

Semantic Segmentation

  • Dr. Eyal Gruss

Director of AI, Flatspace

slide-2
SLIDE 2

Talpiyot PhD Physics Machine Learning

  • Researcher
  • Consultant
  • Entrepreneur

Digital Artist

Eyal Gruss

slide-3
SLIDE 3

For photorealistic VR experience

3D Model

Using deep neural networks

Architectural Interpretation Bitmap Floorplan

An AI-powered service that creates a VR model from a simple floorplan.

Flatspace

Demo video: http://flatspace.xyz

slide-4
SLIDE 4

28.19% 25.77% 16.42% 11.74% 6.66% 3.57% 2.99% 2.25% 5.10%

0% 5% 10% 15% 20% 25% 30% 2010 2011 2012 2013 2014 2015 2016 2017 Human level Top 5 classification error Move to deep neural networks: AlexNet

Image Recognition (ImageNet ILSVRC)

GoogLeNet Microsoft Residual Net

1.2M train images, 100k test images, 1000 categories

Trimps- Soushen Ministery

  • f public

security, China Karpathy Momenta/ Oxford

slide-5
SLIDE 5

Object Detection and Recognition (ImageNet)

googleresearch.blogspot.com/2014/09/ building-deeper-understanding-of- images.html (Szegedy et al., GoogLeNet)

Live:

  • VGG
  • YOLO
  • YOLO v2
  • LeCun

Concurrence, Localization Occlusion Out of context Counting Tracking

slide-6
SLIDE 6

Multi Instance Semantic Segmentation

Li et al., arxiv.org/abs/1611.07709

Won the COCO 2016 Detection Challenge (for segmentation)

slide-7
SLIDE 7

Adversarial Perturbations Against Semantic Segmentation

Fischer et al., arxiv.org/abs/1703.01101 Xie et al., arxiv.org/abs/1703.08603 Metzen et al., arxiv.org/abs/1704.05712 Cisse et al., arxiv.org/abs/1707.05373

slide-8
SLIDE 8

Other related tasks

  • Edge detection
  • Semantic edge detection
  • Surface normals
  • Matting / objectness (foreground/background)
  • Saliency / memorability
  • Pose estimation
  • Depth estimation
  • Optical flow interpolation and estimation
  • Motion prediction
  • E.g. Eigen and Fergus, UberNet, PixelNet

combine several of the above

slide-9
SLIDE 9

This talk: Semantic Segmentation

aka: scene labeling / scene parsing / dense prediction / dense labeling / pixel-level classification

(d) Input (e) semantic segmentation (f) naive instance segmentation (g) instance segmentation (e) semantic segmentation

slide-10
SLIDE 10

Datasets and use cases

  • General
  • Pascal VOC 2012
  • MS COCO (evaluation only for instance segmentation)
  • ADE20K / SceneParse150K (all pixels annotated)
  • DAVIS 2017 (video; review)
  • Urban (e.g. for autonomous vehicles)
  • Cityscapes (all pixels annotated)
  • CMP Facades (strong priors)
  • KITTI road/lane
  • CamVid (all pixels annotated, video)
  • Aerial / Satellite
  • ISPRS Potsdam and Vaihingen
  • DSTL Kaggle (multi-modal)
  • Human parsing (LIP, MHP)
  • Medial imaging (can be 2.5D/multi-view)
  • More: riemenschneider.hayko.at/vision/dataset
slide-11
SLIDE 11

Pascal VOC 2012 11,530 6,929 20 + background Train+Validation: github.com/nightrome/really- awesome-semantic-segmentation

slide-12
SLIDE 12

Evaluation metrics

  • Pixel accuracy (dominated by background class)
  • Mean accuracy over classes (individual class recall does not penalize false pos; must include

background class)

  • Jaccard index = Intersection over Union (IoU) = (GT ∩ Pred) / (GT U Pred) = TP / (TP + FN + FP)
  • <= Recall = TP / GT, Precision = TP / Pred
  • Usually: mean over classes, on the whole dataset
  • Can include or exclude the background class
  • Can be mean over images instead of whole dataset
  • Can be frequency weighted (unbalanced, similar to pixel accuracy)
  • Can be weighted by inverse instance size (cityscapes, important in traffic use cases)
  • Can be averaged with e.g. pixel accuracy (ADE20K)
  • Dice index = F1 score = 2(GT ∩ Pred) / (GT + Pred) = 2TP / (2TP + FN + FP)
  • = Harmonic mean of Recall and Precision
  • = 2IoU / (1 + IoU), Monotonic with IoU
slide-13
SLIDE 13

Evaluation metrics

  • Trimap IoU around boundaries 4/8px (Krähenbühl and Koltun, Kohli et al.)
  • Boundary F1 (BF) - Nearest boundary pixel distance (Csurka et al.)
  • For some distance error tolerance = e.g. 0.75% of the image diagonal
  • Can be averaged with IoU (Davis)
  • Average precision (AP) = Area under the precision-recall curve (MS COCO)
  • Here precision, recall are instance-level given some IoU threshold e.g. 0.5
  • Can be additionally averaged over different thresholds (e.g. 0.5 - 0.95 in steps of 0.05)
  • Multiple detections (instance fragmentation) are counted as false positives beyond

the best

  • Primary metric for instance segmentation (pixel-level metrics can be ambiguous)
slide-14
SLIDE 14

Loss

  • Cross entropy = - sumclasses sumpixels p*log(q)
  • p = targets; q = output probabilites
  • Can be weighted by inverse class size
  • Can be weighted to emphasize areas around edges (U-Net, Meyer)
  • IoU approximated with probabilities = sumclasses [(sumpixels p*q) / sumpixels (p + q – p*q)]
  • Approximation is needed since IOU is discrete
  • Makes sense since this is our evaluation metric
  • Multiclass formulation is balanced over class sizes
  • Rediscovered in literature from time to time [1-16]
  • Visualead reported mixed results
  • Loss =
  • IoU [1 2 3 4 5 6]
  • Dice [7 8 9 10]
  • Tversky generalization [11]
  • 0.1 * CE + 0.9 * (1 - Dice) [12]
  • CE - log(IoU) [13]
  • Other approximations [14 15 16 (TBD in TF)]
  • Total variation smoothing = sumclasses sumx,y |qx+1,y – qx,y|+|qx,y+1 – qx,y|
  • Adversarial (later)
slide-15
SLIDE 15

Architectures

1. Patchwise CNN 2. FCN 3. DeepLab 4. DeconvNet 5. U-Net 6. SegNet 7. Dilated Convolutions (Yu and Koltun) 8. 100-Layer Tiramisu (DesneNets) 9. Wide ResNet 10. PSPNet 11. Adversarial 12. PolygonRNN 13. Mask R-CNN 14. Semi-supervised with unsupervised loss

slide-16
SLIDE 16

Patchwise CNN

  • Ning et al., http://yann.lecun.com/exdb/publis/pdf/ning-05.pdf
  • Ciresan et al., people.idsia.ch/%7Ejuergen/nips2012.pdf
  • A sliding window CNN classifies each pixel in turn
slide-17
SLIDE 17

Fully Convolutional NN

  • cs231n.github.io/convolutional-networks/#converting-fc-layers-to-conv-

layers

  • Start from a CNN classifier
  • Convert fully connected to conv (with filter size = input volume, no padding):
  • CNN -> 7*7*512 -> fc(4096) -> 4096
  • > fc(1000) -> 1000
  • CNN -> 7*7*512 -> conv(7*7*4096) -> 1*1*4096 -> conv (1*1*1000) -> 1*1*1000
  • Can take arbitrarily larger input:
  • 224*224 -> 7*7*512 -> 1*1*100
  • 384*384 -> 12*12*512 -> 6*6*100
  • Equivalent to sliding a patchwise CNN, but

with a single pass that is much more efficient due to convolution sharing

slide-18
SLIDE 18

Deconvolution/Upconvolution Layers

  • FC convolution transposed
  • cs231n.stanford.edu/slides/2017/

cs231n_2017_lecture11.pdf

  • Fractionally strided convolution
  • github.com/vdumoulin/conv_arithmetic

Stride = 2 Stride = 1/2 input

(Resolution Increasing Convolutions)

slide-19
SLIDE 19

Fully Convolutional Network (FCN; 2014-11)

  • Long et al., arxiv.org/abs/1411.4038
  • Shelhamer et al., arxiv.org/abs/1605.06211
  • Start from classification CNN pre-trained on

ImageNet (AlexNet/VGG-16/GoogLeNet) and convert fully connected to conv (conv7)

  • Replace final layer to 1*1*21 and add bilinear

upsampling to get full spatial output (FCN-32s)

  • Add x2 deconv (initialized as bilinear) on conv7

and sum with conv prediction added to pool4

  • Add bilinear upsampling to get full spatial
  • utput (FCN-16s). Fine tune from FCN-32s
  • Do similarly for above fuse and pool3 (FCN-8s)
  • Pascal VOC 2012 IoU=62.2%-67.2% (up from 51.6%)
  • 100-175 ms

(vs. 50 s)

  • 134M params
slide-20
SLIDE 20

DeepLab (2014-12)

  • Chen et al. (Google), arxiv.org/abs/1412.7062
  • VGG-16 pre-trained on ImageNet -> fully conv
  • Cancel last two max-pool
  • Change conv after above to x2/x4 dilated convolutions
  • Train with x8 subsampled targets (IoU<90.7%). Infer with bilinear upsampling.
  • Fully connected CRF (raw image dependent potential) post-processing in inference (+ 3%-5%)
  • Add multi-scale layers fine tuned separately (similar to FCN-8s but with concats and convs)
  • Increase dilation for first FC layer to x12 (large field of view) + change FC kernel, filters
  • 20.5M params
  • Pascal VOC 2012 IoU = 71.6%
  • V2: arxiv.org/abs/1606.00915 with ResNet-101 + “atrous spatial pyramid pooling”
  • Pascal VOC 2012 IoU = 79.7% Cityscapes IoU = 70.4%
  • V3: arxiv.org/abs/1706.05587
  • Pascal VOC 2012 IoU = 86.9% Cityscapes IoU = 81.3% (SOTA 2017)

Before softmax After softmax

hole = atrous = dilated convolutions increase field of view without decreasing resolution,

  • r adding parameters
slide-21
SLIDE 21

DeconvNet (2015-05)

  • Noh et al., arxiv.org/abs/1505.04366
  • VGG-16 pre-trained on ImageNet
  • Unpooling layers use saved max pooling indices
  • Symmetric encoder-decoder: multiple deconvolutions + BatchNorm + ReLU (no dropout)
  • Relies on region proposals. Training with two-stage curriculum learning:
  • 1. Instances cropped to GT bounding boxes * 1.2, all non-class pixels labeled as background
  • 2. Object proposals from edge-box * 1.2
  • Inference:
  • Top 50 objectness score of 2000 edge-box object proposals, Max per pixel/class before softmax
  • Fully connected CRF post-processing (+ ~1%)
  • 252M params
  • Pascal VOC 2012 IoU = 70.5%
  • Ensemble with FCN-8s = 72.5%
slide-22
SLIDE 22

U-Net (2015-05)

  • Ronneberger et al., arxiv.org/abs/1505.04597
  • No VGG! Not pre-trained!
  • Skip connections to keep res.!
  • Separate deconv to:

learned 2x2 upconv + (3x3 regular conv + ReLU) * 2

  • Weighting to emphasize areas

around morphological edges

  • Implementations I’ve seen

use half the filters and padding

dropout

slide-23
SLIDE 23

SegNet (2015-11)

  • Badrinarayanan et al., arxiv.org/abs/1505.07293 arxiv.org/abs/1511.00561
  • VGG-16 pre-trained on ImageNet (without fully connected layers)
  • Unpooling layers use saved max pooling indices like in DeconvNet
  • Deconvolutions + BatchNorm + ReLU (some dropout)
  • They compare various decoders, and dropouts (arxiv.org/abs/1511.02680)
  • Pascal VOC 2012 IoU = 59.9%
slide-24
SLIDE 24

Dilated Convolutions (2015-11)

  • Yu and Koltun, arxiv.org/abs/1511.07122
  • Front-end network + Context aggregation network
  • Front-end is a truncated VGG-16 like DeepLab + dilated convs,

pre-trained on Pascal VOC 2012

  • Context aggregation is a 7-layer uniform resolution dilated convs +

ReLUs, with increasing dilations and initialized to unit filters

  • Train with x8 subsampled targets. Front-end is trained first. Then context is added

and trained with fixed front-end

  • Possible post-processing with fully connected CRF / CRF-RNN
  • Front-end alone: Pascal VOC 2012 IoU = 71.3%
  • Front-end + Context + CRF-RNN: Pascal VOC 2012 IoU = 75.3%
  • Dilation10: Cityscapes IoU = 67.1%
slide-25
SLIDE 25

The One Hundred Layers Tiramisu (2016-11)

  • Jegou et al., arxiv.org/abs/1611.09326
  • DenseNets (few params, easy training)
  • Encoder-Decoder with skip

connections

  • 56 – 103 layers
  • 1.5M – 9.4M params
  • No pre-training
  • No / negative results on large

benchmarks

slide-26
SLIDE 26

Wide ResNet (2016-11)

  • Wu et al., arxiv.org/abs/1611.10080
  • Wider or Deeper Resnets? Wider!
  • See also Littwin and Wolf, arxiv.org/abs/1611.02525
  • Wide 7-block ResNet pre-trained for classification, adapted to dilated

a la DeepLab

  • Pascal VOC 2012 IoU = 82.5%
  • Cityscapes IoU = 78.4%
  • ADE20K avg(pixel acc., IoU) = 56.74%
slide-27
SLIDE 27

Pyramid Scene Parsing (PSPNet; 2016-12)

  • Zhao et al., arxiv.org/abs/1612.01105 (trained models: Caffee, Keras)
  • Pre-trained dilated 101-269 ResNet + deep supervision auxiliary loss

+ pyramid pooling module

  • Pascal VOC 2012 IoU = 85.4% (1st place 2016)
  • Cityscapes IoU = 80.2% (1st place 2016). Video
  • ADE20K avg(pixel acc., IoU) = 57.21% (1st place 2016) SOTA!

(2016)

slide-28
SLIDE 28

Mismatched Relationship Confusion Categories Inconspicuous Classes

slide-29
SLIDE 29

Generative Adversarial Nets

Goodfellow et al., arxiv.org/abs/1406.2661 Generator

תרצוי

Discriminator

(Curator) תרצוא Fake or Real? Fake Real

slide-30
SLIDE 30

Image to Image Translation

With Conditional Adversarial Networks (PatchGAN)

Isola et al., phillipi.github.io/pix2pix Interactive: affinelayer.com/pixsrv

Guide: ml4a.github.io/guides/Pix2Pix fotogenerator.npocloud.nl

slide-31
SLIDE 31

Adversarial (2016-09)

  • Idea is to regulate naturalness (strong and smooth classes, sharp boundaries, denoising, global

structure)

  • David Golan et al. (2016-09, first one AFAIK)
  • Pix2pix, Isola et al., arxiv.org/abs/1611.07004
  • Generator is U-Net style (with skip connections)
  • 4x4 Conv with stride 2 – BatchNorm - ReLU (+ some dropout). No max-pooing.
  • L1 loss for generator
  • “PatchGAN” Discriminator takes both image and segmentation, averages over 70x70 patches
  • Adversarial loss hurts! Cityscapes IoU = 29% 
  • L1 only Cityscapes IoU = 35% 
  • FAIR, Luc et al., arxiv.org/abs/1611.08408
  • Generator is Yu and Koltun’s Dilated8
  • Cross-entropy loss for generator
  • Discriminator issue: we feed it continuous probabilities (cannot do sgd with discrete labels), but GT are discrete
  • Tested product with image and scaling GT, as alternative input to discriminator, but results were the same
  • Pascal VOC 2012 IoU = 73.3% (compare to Yu and Koltun’s 71.3%). Adversarial ~ 2%
  • Several citations using this
slide-32
SLIDE 32

PolygonRNN (2017-04)

  • Castrejon et al. CSC2523_Project_Report, arxiv.org/abs/1704.05548
  • Spare representation using polygons
  • Cityscapes IoU = 61.4% per instance, assuming given bounding boxes
  • Can speed-up manual annotation
  • CVPR 2017 Best Paper Honorable Mention Award (video)

(ConvLSTM)

slide-33
SLIDE 33

Mask R-CNN (2017-03)

  • He et al., arxiv.org/abs/1703.06870 (tutorial)
  • Instance segmentation SOTA
slide-34
SLIDE 34

Semi-Supervised Semantic Segmentation with Unsupervised Total Variation Loss

  • Javanmard et al.,

arxiv.org/abs/1605.01368

Supervised Proposed 10 pix/image 10 pix/image Full labels GT

slide-35
SLIDE 35

Meta references

  • Janai et al., arxiv.org/abs/1704.05519 (chapter 6)
  • Garcia-Garcia et al., arxiv.org/abs/1704.06857
  • meetshah1995.github.io/semantic-segmentation/deep-

learning/pytorch/visdom/2017/06/01/semantic-segmentation-over-the-years

  • blog.qure.ai/notes/semantic-segmentation-deep-learning-review
  • handong1587.github.io/deep_learning/2015/10/09/segmentation
  • github.com/kjw0612/awesome-deep-vision#semantic-segmentation
  • github.com/mrgloom/Semantic-Segmentation-Evaluation
  • github.com/fchollet/keras/issues/6538
slide-36
SLIDE 36

Thanks!

  • Slides: bit.ly/semantic-segmentation
  • Contact: eyal.gruss@gmail.com