Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - - PowerPoint PPT Presentation

compressing coldbox data
SMART_READER_LITE
LIVE PREVIEW

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - - PowerPoint PPT Presentation

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR: Lossless compression factor = 4 Implies reduction from 12bits/ADC readout to 3 bits per ADC readout In the rest of this talk, not


slide-1
SLIDE 1

Compressing Coldbox Data

Ivan K. Furic, Remington Gerras University of Florida

slide-2
SLIDE 2

ProtoDUNE-SP TDR:

  • Lossless compression factor = 4
  • Implies reduction from 12bits/ADC

readout to 3 bits per ADC readout

  • In the rest of this talk, not discussing

factors, only average bits / ADC readout

  • Hence, keep in mind:
  • “3 bits” = TDR spec
  • “4 bits” = compression factor 3
  • “6 bits” = compression factor 2
slide-3
SLIDE 3

How well does a generic algorithm work?

  • ROOT’s native compression for 10 events, 1536 channels
  • 10k ADC readouts per channel per event, 2 bytes per ADC readout
  • Compressed: avg 5.73 bits per ADC readout

[effective compression factor 2.1, half of the TDR spec]

slide-4
SLIDE 4

Using “gzip -9” explicitly

  • Store data for a single channel in a file, compress
  • Performance depends on how the bits are packed in the file
  • Convention in figures below: 12 bits = 3 nibbles: H,M,L
slide-5
SLIDE 5

What RMS will compress into 3 bits?

  • Consider “ideal” case for compression - uniform distribution of values
  • A uniform distribution across D consecutive discrete values has an

RMS of ! =

# √%& ;

( = ! 12 is the width of a flat distribution needed for a given !

  • To encode D discrete values, one requires log2(D) bits:
  • +,-./ = log & ( = log & ! 12 = log &(!) +

% & log & 12 = log &(!) + 1.8

  • In order to encode into 3 bits of data, the RMS of the distribution

can’t be more than 2.3 ADC counts

  • Observed pedestal RMS’s are 6-8 ADC counts
  • Encoding raw values will not provide desired compression
slide-6
SLIDE 6

Information Theory limits on compression

  • For a stochastic noiseless source emitting a set of symbols

with frequencies p_i, the number of bits per symbol is the (Shannon) entropy:

  • Shannon, Claude E. (July–October 1948). "A Mathematical

Theory of Communication". Bell System Technical

  • Journal. 27 (3): 379–423.
slide-7
SLIDE 7

Gaussian distributed discrete random values

  • Huffman compression

achieves Shannon entropy level of performance

  • Need RMS of 2 bins to

compress into 3 bits

  • RMS of 4 bins should

compress into 4 bits

  • RMS’s of 6-8 bins should

compress into 4.6-5.0 bits

slide-8
SLIDE 8

Variable Distributions, Run #1287

Xn Xn-Xn-1 Xn-2Xn-1+Xn-2

  • Consider three variables as targets to encode using a compression algorithm

Raw ADC Counts Difference wrt previous count Difference wrt linear prediction (based on previous two counts)

slide-9
SLIDE 9

Variable Distribution RMS’s:

Raw ADC Counts Difference Linear prediction

slide-10
SLIDE 10

Truncated Huffman compression

  • Raw ADC counts: tree encodes values seen in event
  • For target variables, expect most values are in the range [-16,16]
  • Huffman-encode only this window
  • RAW + target: have additional (13-14 bit) Huffman code for

“value outside range”, followed by full 12-bit value

  • 25 bit penalty for data not under control
  • compression performance will be worse than Shannon entropy
slide-11
SLIDE 11
  • Green = Shannon

entropy

  • Blue = Channel+Event

specific Huffman Trees

  • Red = Use one

(random) Huffman Tree for all data Distributions of avg bits per ADC word

  • bserved per channel, per event
  • Raw data requires lots of custom Huffman Trees
  • Encoding diff wrt linear prediction works best

(avg less than 4 bits per ADC word)

Performance on Run #1287

Encode Raw Values Encode Differences Encode wrt Linear Prediction

slide-12
SLIDE 12

Performance Loss For Generic Trees

  • For two target variables, lose fraction of a bit in performance
  • Linear predictor loss is better contained, i.e. performance more predictable

Encode Differences Encode wrt Linear Prediction

slide-13
SLIDE 13

Raw ADC Value Correlation Factors

  • Reproduced correlations observed by Tom in run 973
  • Data in run 1287 appears to be much less correlated
slide-14
SLIDE 14

What’s different between the two runs?

  • Run 1287 has no correlation factors greater than ~10%
  • Run 973 has a significant tail in the RMS distribution
  • Possibly due to slow noise in the electronics?

Run #1287

Raw ADC Channel-Channel Correlation Factor Raw ADC Channel-Channel Correlation Factor

Run #973

slide-15
SLIDE 15

Example: Anti-correlation from slow noise

  • Waveform for first event, channels 1199 vs 1216
  • Causes significant increase in RMS, almost 100% uncorrelated
slide-16
SLIDE 16

Comparison of variable RMS’s per channel:

  • Run 973 overall behavior of target variables is “better” than 1287
  • Expect run 973 to compress better than run 1287
slide-17
SLIDE 17

Compression performance on run 1287 vs 973

  • Encoding Difference wrt previous ADC count
slide-18
SLIDE 18

Compression Performance, run 1287 vs 973, cont’d

  • Encoding difference wrt Linear Prediction
slide-19
SLIDE 19

Estimated Event Size

  • ProtoDUNE-SP TDR spec is to compress 230.4 MB of TDC data into 57.6 MB
  • Run compression test on 10 events, for both runs, record #bits used
  • Run 1287 conveniently reads out 1536 channels, 1/10th of full protoDUNE-SP
  • Run 973 has 2304 channels reading out, scale numbers by 1536/2304

Run Number Difference, Custom Trees Difference, Single Tree Linear Prediction, Custom trees Linear Prediction, Single Tree Size wrt TDR Spec 1287 72.5 MB 73.4 MB 71.5 MB 72.2 MB +25% 0973 (scaled) 70.3 MB 71.1 MB 70.3 MB 70.4 MB +22%

  • 25% larger event size than required by TDR spec
  • ADC readout encoded on avg in 3.75 bits (TDR spec is 3)
  • Compression factor 3.20 (TDR spec is 4)
slide-20
SLIDE 20

Conclusions, so far

  • Evaluated compression performance on coldbox data
  • Found two good candidate variables for encoding
  • Evaluated encoding with “truncated” Huffman compression
  • Found approach to be generic and robust
  • ~1% penalty for sub-optimal encoding tree, even across events
  • Expect similar performance for hard-coded common tree for all channels, all

events (simplifies firmware implementation)

  • No performance loss in presence of “slow” noise
  • Estimate compressed event size to be 25% larger than TDR spec
  • No significant channel noise cross-correlation observed (in run #1287)
  • Likely not much to gain from combining information across channels
  • Found promising correlations with ADC counts earlier in the stream

(further reduce avg RMS by 10%, i.e. 5% better compression)

slide-21
SLIDE 21

Plans

  • Check cross-channel correlation between encoding variables
  • Re-check gzip performance on larger sample of events
  • Attempt to utilize information from earlier in the stream to further

shrink target variable RMS

  • Choose single, hardcoded compression tree
  • Optimize decompression algorithm for speed, report performance
  • Study per-event compression performance on larger sample

(e.g. entire run 1287)

  • Try ”gzip -9” on compressed output
  • Any other tests?
  • Report back with final findings, document