Lesson learnt from WebP. Whats next? Pascal Massimino - - PowerPoint PPT Presentation

lesson learnt from webp what s next
SMART_READER_LITE
LIVE PREVIEW

Lesson learnt from WebP. Whats next? Pascal Massimino - - PowerPoint PPT Presentation

Lesson learnt from WebP. Whats next? Pascal Massimino skal@google.com Plan lessons learnt from VP8 -> WebP codec research direction and experiments for WebP v2 results (+demo?) Motivation WebP, HEIF, AVIF ...


slide-1
SLIDE 1

Lesson learnt from WebP. What’s next?

Pascal Massimino skal@google.com

slide-2
SLIDE 2

Plan

  • lessons learnt from VP8 -> WebP codec
  • research direction and experiments for “WebP v2”
  • results (+demo?)
slide-3
SLIDE 3

Motivation

WebP, HEIF, AVIF ...

slide-4
SLIDE 4

Motivation

WebP, HEIF, AVIF … most recent Image codecs originate from Video codec.

slide-5
SLIDE 5

Motivation

WebP, HEIF, AVIF … most recent Image codecs originate from Video codec. Is it a always a good choice?

slide-6
SLIDE 6

Lessons learnt from VP8 -> WebP

slide-7
SLIDE 7

Lessons learnt from VP8 -> WebP

Two main use-cases for image compression:

  • “Capture” [device -> storage / CDN]
slide-8
SLIDE 8

Lessons learnt from VP8 -> WebP

Two main use-cases for image compression:

  • “Capture” [device -> storage / CDN]
  • “Web consumption” [CDN -> mobile device]
slide-9
SLIDE 9

Lessons learnt from VP8 -> WebP

Two main use-cases for image compression:

  • “Capture” [device -> storage / CDN]
  • “Web consumption” [CDN -> mobile device]

“WebP”

slide-10
SLIDE 10

Web image format

important peculiarities

slide-11
SLIDE 11

Web image format important peculiarities

  • incremental decoding
  • memory consumption
  • small format overhead
  • interleaved chunk data for early display
  • efficient lossy/lossless transparency
  • efficient lossless coding
  • preview
  • light ‘animation’ format (!= video)
  • efficient in software, more than hardware
slide-12
SLIDE 12

Web image format important peculiarities

  • incremental decoding
  • memory consumption
  • small format overhead
  • interleaved chunk data for early display
  • efficient lossy/lossless transparency
  • efficient lossless coding
  • preview
  • light ‘animation’ format (!= video)
  • efficient in software, more than hardware

WEBP v2 !!

slide-13
SLIDE 13

WebP v2: experimentations

Goal: v2 = like v1 … “Web-consumption”, not “Capture”.

slide-14
SLIDE 14

WebP v2: experimentations

Goal: v2 = like v1 … … but ‘more’. “Web-consumption”, not “Capture”.

slide-15
SLIDE 15

WebP v2: experimentations

Goal: v2 = like v1 … … but ‘more’. And speed. “Web-consumption”, not “Capture”.

slide-16
SLIDE 16

WebP v2: experimentations

Goal: v2 = like v1 … … but ‘more’. And speed. And HDR. “Web-consumption”, not “Capture”.

slide-17
SLIDE 17

WebP v2: how do we improve upon v1? What can we do differently than AV1?

slide-18
SLIDE 18

WebP v2: how do we improve upon v1?

  • floating partitioning
  • small-context residual coding
  • non-classic residuals
  • custom predictors
  • CfL
  • lossy/lossless alpha
  • more filters
  • more predictors
  • interruptibility
  • custom CSP transform
  • ANS + adaptive multi-symbol dictionaries
  • tiles
slide-19
SLIDE 19

WebP v2: how do we improve upon v1?

  • floating partitioning [wip]
  • small-context residual coding [go]
  • non-classic residuals [failed so far]
  • custom predictors [failed so far]
  • CfL [go]
  • lossy/lossless alpha [go]
  • more filters [wip]
  • more predictors [failed so far]
  • interruptibility [go]
  • custom CSP transform [go]
  • ANS + adaptive SIMD multi-symbol dictionaries [go]
  • tiles [go]
slide-20
SLIDE 20

WebP v2: how do we improve upon v1?

  • floating partitioning [wip]
  • small-context residual coding [go]
  • non-classic residuals [fail]
  • custom predictors [fail so far]
  • CfL [go]
  • lossy/lossless alpha [go]
  • more filters [wip]
  • more predictors
  • interruptibility [go]
  • custom CSP transform [go]
  • ANS + adaptive multi-symbol dictionaries [go]
  • tiles
slide-21
SLIDE 21

classic AV1 block partitioning

(low quality)

slide-22
SLIDE 22

floating block-partitioning

slide-23
SLIDE 23

floating block-partitioning

Parsing order = lexicographic order X-Y sorted Buffer = 32 px-high rolling cache (max block = 32x32) Memory = O(32 * tile_width)

1 2 3 4 5 6 7 8 9 tile width 32px

slide-24
SLIDE 24

floating block-partitioning

Parsing order != decoding order Strategy: try to maximize the left-sample availability

1 1 2 2 8 3 9 9 12 10 4 5 5 4 3 6 7 7 10 12 11 11 6 8 13 13 14 14 15 15 16 16

slide-25
SLIDE 25

1

floating block-partitioning

Parsing order != decoding order Strategy: try to maximize the left-sample availability

2 !!

slide-26
SLIDE 26

1

floating block-partitioning

Parsing order != decoding order Strategy: try to maximize the left-sample availability

(5) 2 (3) (4) (6)

slide-27
SLIDE 27

1

Parsing order != decoding order Strategy: try to maximize the left-sample availability

4 2 !! 5 3 FLUSH!!

floating block-partitioning

slide-28
SLIDE 28

1

Parsing order != decoding order Strategy: try to maximize the left-sample availability

4 2 !! 5 3 (6) (7)

floating block-partitioning

slide-29
SLIDE 29

1

Parsing order != decoding order Strategy: try to maximize the left-sample availability

4 2 8 5 3 6 7

floating block-partitioning

slide-30
SLIDE 30

Problem: the search space is HUGE floating block-partitioning

slide-31
SLIDE 31

How to do RD-Opt with this vast search space??

slide-32
SLIDE 32

Floating partitioning algo

Algo for finding a partitioning of a 32x32 section:

  • use variance to label 4x4 blocks with four buckets.

Variance of input 4x4 blocks: 14.0 12.5 12.0 11.8 11.3 8.1 11.1 10.1 14.6 12.0 13.3 12.6 11.9 9.9 13.3 8.7 12.2 14.6 12.6 15.0 10.3 9.2 11.5 11.2 74.7 80.8 103.0 118.5 80.1 16.6 13.2 20.5 37.4 33.4 39.2 35.6 34.6 59.8 114.7 93.4 34.5 29.9 33.1 30.2 33.4 30.0 32.4 25.2 32.1 29.9 37.1 34.5 34.7 33.7 29.9 21.7 32.9 31.5 29.6 36.1 35.9 28.7 33.3 29.4

slide-33
SLIDE 33

Floating partitioning algo

Algo for finding a partitioning:

  • use variance to label 4x4 blocks with four buckets.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 2 0 0 0 1 1 1 1 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1

  • lay down boxes with same labels,
  • starting from the largest down to the smallest (finishing fill with 4x4 boxes).
slide-34
SLIDE 34

Floating partitioning algo

Algo for finding a partitioning:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 2 0 0 0 1 1 1 1 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1

slide-35
SLIDE 35

Floating partition algo

Algo for finding a partitioning:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 2 0 0 0 1 1 1 1 1 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1

slide-36
SLIDE 36

Floating partitioning algo

  • Variance isn’t necessary a good metric
  • too many ‘small’ blocks for filling gaps
  • so many other algos to try!
slide-37
SLIDE 37

Floating partitioning algo

  • > Still a lot of potential

trading geometry vs residuals

slide-38
SLIDE 38

Residual coding

slide-39
SLIDE 39

Residual coding

3 1 3

  • 1
  • 2

2

  • 1
  • 6

1 4

  • 1
  • 1
  • 1

Bounds: use Adaptive Bit to say if the residuals are bounded in X/Y. If bounded, store bounds as range. Residual: parse as zigzag but skip anything that is outside the box:

slide-40
SLIDE 40

Residual coding

3 1 3

  • 1
  • 2

2

  • 1
  • 6

1 4

  • 1
  • 1
  • 1
  • 1

EOB: Adaptive Bit, but only if we have already touched both sides of the bounding box. Only 1s after When finding a 1, ABit that indicates whether all elements after are 1s.

slide-41
SLIDE 41

Custom CSP transform

slide-42
SLIDE 42

Custom CSP transform

Use PCA to tight-fit the color transform matrix.

slide-43
SLIDE 43

Lossy-lossless alpha mix

slide-44
SLIDE 44

Lossy-lossless alpha mix

slide-45
SLIDE 45

Lossy-lossless alpha mix

slide-46
SLIDE 46

218 bytes. In the header.

Triangle-based preview

slide-47
SLIDE 47

Triangle-based preview

ICIP 2018 Paper.

slide-48
SLIDE 48

WebP v2:

results so far

slide-49
SLIDE 49

WebP v2: results so far. The Good.

slide-50
SLIDE 50

WebP v2: results so far. The Bad.

slide-51
SLIDE 51

WebP v2: results so far. The Ugly.

slide-52
SLIDE 52

also good

slide-53
SLIDE 53

Syntactic decomposition

AV1

slide-54
SLIDE 54

Syntactic decomposition

WP2

block size coding seems more efficient! at the detriment of block header

trading geometry vs residuals!

slide-55
SLIDE 55

Enc Speed comparison

> ./examples/rd_curve kodim19.png -nomt -av1 -jpeg -webp -ssim # Q {size (bytes), bpp, psnr (dB), SSIM*, enc-time (sec), dec-time (sec)} # | WP2 | WebP | AV1 | JPEG 0.0 5074 0.10 27.07 6.50 1.79 0.10 5028 0.10 26.49 6.44 0.04 0.00 8305 0.17 30.15 7.98 5.28 0.02 4315 0.09 22.65 5.12 0.01 0.00 12.1 5776 0.12 27.50 6.69 1.86 0.10 13026 0.27 30.42 8.10 0.04 0.00 29446 0.60 35.15 11.92 12.23 0.02 11653 0.24 28.51 7.17 0.01 0.00 24.3 6834 0.14 28.24 6.99 1.81 0.09 18850 0.38 31.72 9.09 0.03 0.00 47852 0.97 37.74 14.02 18.20 0.03 19015 0.39 30.71 8.55 0.01 0.00 36.4 8308 0.17 29.04 7.32 1.83 0.09 24882 0.51 32.88 10.06 0.04 0.00 54919 1.12 38.48 14.61 20.71 0.03 25183 0.51 31.94 9.38 0.01 0.00 48.6 11780 0.24 30.17 7.96 1.70 0.11 31518 0.64 34.04 11.04 0.04 0.00 54919 1.12 38.48 14.61 20.71 0.04 30969 0.63 32.97 10.12 0.02 0.00 60.7 17264 0.35 31.79 9.04 1.79 0.11 37818 0.77 34.99 11.79 0.04 0.00 54919 1.12 38.48 14.61 20.86 0.03 36423 0.74 33.78 10.72 0.01 0.00 72.9 28386 0.58 34.12 10.80 1.92 0.10 44738 0.91 35.93 12.52 0.05 0.00 54919 1.12 38.48 14.61 20.95 0.03 46192 0.94 35.07 11.67 0.02 0.00 85.0 65536 1.33 39.15 14.45 2.28 0.11 73180 1.49 38.92 14.84 0.05 0.01 54919 1.12 38.48 14.61 21.22 0.03 65399 1.33 37.25 13.18 0.02 0.00

WebP 3x jpeg = ref AV1 1200x WP2 120x

slide-56
SLIDE 56

WebP v2: demo

[video]

slide-57
SLIDE 57

Conclusion

Plan for 2020:

  • finalize the decoding tools for experiments
  • release the code base as starting point
slide-58
SLIDE 58

Thanks!

Questions?

slide-59
SLIDE 59

Extra material

slide-60
SLIDE 60

incremental decoding

using fiber / coroutines to pass control around between codec and network.

slide-61
SLIDE 61

Not yet available data Available chunk

CreateLocalContext() Yield()

Bitstream Codec::Read(data) (main context) Codec::Decode() (local context) Time / CPU usage User (calling site)

WaitForNewPacket() New data chunk WaitForNewPacket() Give execution control

Successful ANSDec:: ReadNextWord() Successful ANSDec:: ReadNextWord() Successful ANSDec:: ReadNextWord() Blocking ANSDec:: ReadNextWord()

Output buffer

return Status::Suspended;

slide-62
SLIDE 62

Still not there Available chunk

Resume() Yield()

Bitstream Codec::Read(data) (main context) Codec::Decode() (local context)

New data chunk

Was blocking, now successful ANSDec:: ReadNextWord() Blocking ANSDec:: ReadNextWord()

Discarded data

WaitForNewPacket() return Status::Suspended;

Time / CPU usage

Successful ANSDec:: ReadNextWord()

User (calling site) Output buffer

slide-63
SLIDE 63

Available chunk

Resume() Close()

Bitstream Codec::Read(data) (main context) Codec::Decode() (local context)

New data chunk

Discarded data

OnDecodedImage() return Status::Decoded;

Time / CPU usage User (calling site) Output buffer

slide-64
SLIDE 64

Incremental decoding

Don’t assume you have the complete data for the whole frame

  • ne must be able to quickly suspend / resume the decoding with as few work as possible
  • > check points
  • > coroutines in the bit-reader’s TryReadNext()

Corollary: good decoding error trapping and reporting is critical

slide-65
SLIDE 65

Memory consumption

Video decoding = several buffers (Ref, Alt-ref, etc.) WebP = O(width) memory consumption Blit to screen ASAP animation = 1 buffer only

slide-66
SLIDE 66

Hardware = difficult for images

Hardware decoding is:

  • per-frame oriented, non-interruptible
  • tricky to re-configure
  • non-parallelizable
  • unstable, sandboxed
  • has transfer overhead
slide-67
SLIDE 67

Hardware = difficult for images

WebP experiment with Android vp8 hardware:

  • nly 50% faster, but a lot of extra system complexity
  • > Let’s target software decoding !