Compressing Coldbox Data Ivan K. Furic, Remington Gerras University - - PowerPoint PPT Presentation

▶

Sep 21, 2022 169 likes •395 views

Compressing Coldbox Data Ivan K. Furic, Remington Gerras University of Florida ProtoDUNE-SP TDR: Lossless compression factor = 4 Implies reduction from 12bits/ADC readout to 3 bits per ADC readout In the rest of this talk, not

SLIDE 1

Compressing Coldbox Data

Ivan K. Furic, Remington Gerras University of Florida

SLIDE 2

ProtoDUNE-SP TDR:

Lossless compression factor = 4
Implies reduction from 12bits/ADC

readout to 3 bits per ADC readout

In the rest of this talk, not discussing

factors, only average bits / ADC readout

Hence, keep in mind:
“3 bits” = TDR spec
“4 bits” = compression factor 3
“6 bits” = compression factor 2

SLIDE 3

How well does a generic algorithm work?

ROOT’s native compression for 10 events, 1536 channels
10k ADC readouts per channel per event, 2 bytes per ADC readout
Compressed: avg 5.73 bits per ADC readout

[effective compression factor 2.1, half of the TDR spec]

SLIDE 4

Using “gzip -9” explicitly

Store data for a single channel in a file, compress
Performance depends on how the bits are packed in the file
Convention in figures below: 12 bits = 3 nibbles: H,M,L

SLIDE 5

What RMS will compress into 3 bits?

Consider “ideal” case for compression - uniform distribution of values
A uniform distribution across D consecutive discrete values has an

RMS of ! =

# √%& ;

( = ! 12 is the width of a flat distribution needed for a given !

To encode D discrete values, one requires log2(D) bits:
+,-./ = log & ( = log & ! 12 = log &(!) +

% & log & 12 = log &(!) + 1.8

In order to encode into 3 bits of data, the RMS of the distribution

can’t be more than 2.3 ADC counts

Observed pedestal RMS’s are 6-8 ADC counts
Encoding raw values will not provide desired compression

SLIDE 6

Information Theory limits on compression

For a stochastic noiseless source emitting a set of symbols

with frequencies p_i, the number of bits per symbol is the (Shannon) entropy:

Shannon, Claude E. (July–October 1948). "A Mathematical

Theory of Communication". Bell System Technical

Journal. 27 (3): 379–423.

SLIDE 7

Gaussian distributed discrete random values

Huffman compression

achieves Shannon entropy level of performance

Need RMS of 2 bins to

compress into 3 bits

RMS of 4 bins should

compress into 4 bits

RMS’s of 6-8 bins should

compress into 4.6-5.0 bits

SLIDE 8

Variable Distributions, Run #1287

Xn Xn-Xn-1 Xn-2Xn-1+Xn-2

Consider three variables as targets to encode using a compression algorithm

Raw ADC Counts Difference wrt previous count Difference wrt linear prediction (based on previous two counts)

SLIDE 9

Variable Distribution RMS’s:

Raw ADC Counts Difference Linear prediction

SLIDE 10

Truncated Huffman compression

Raw ADC counts: tree encodes values seen in event
For target variables, expect most values are in the range [-16,16]
Huffman-encode only this window
RAW + target: have additional (13-14 bit) Huffman code for

“value outside range”, followed by full 12-bit value

25 bit penalty for data not under control
compression performance will be worse than Shannon entropy

SLIDE 11

Green = Shannon

entropy

Blue = Channel+Event

specific Huffman Trees

Red = Use one

(random) Huffman Tree for all data Distributions of avg bits per ADC word

bserved per channel, per event
Raw data requires lots of custom Huffman Trees
Encoding diff wrt linear prediction works best

(avg less than 4 bits per ADC word)

Performance on Run #1287

Encode Raw Values Encode Differences Encode wrt Linear Prediction

SLIDE 12

Performance Loss For Generic Trees

For two target variables, lose fraction of a bit in performance
Linear predictor loss is better contained, i.e. performance more predictable

Encode Differences Encode wrt Linear Prediction

SLIDE 13

Raw ADC Value Correlation Factors

Reproduced correlations observed by Tom in run 973
Data in run 1287 appears to be much less correlated

SLIDE 14

What’s different between the two runs?

Run 1287 has no correlation factors greater than ~10%
Run 973 has a significant tail in the RMS distribution
Possibly due to slow noise in the electronics?

Run #1287

Raw ADC Channel-Channel Correlation Factor Raw ADC Channel-Channel Correlation Factor

Run #973

SLIDE 15

Example: Anti-correlation from slow noise

Waveform for first event, channels 1199 vs 1216
Causes significant increase in RMS, almost 100% uncorrelated

SLIDE 16

Comparison of variable RMS’s per channel:

Run 973 overall behavior of target variables is “better” than 1287
Expect run 973 to compress better than run 1287

SLIDE 17

Compression performance on run 1287 vs 973

Encoding Difference wrt previous ADC count

SLIDE 18

Compression Performance, run 1287 vs 973, cont’d

Encoding difference wrt Linear Prediction

SLIDE 19

Estimated Event Size

ProtoDUNE-SP TDR spec is to compress 230.4 MB of TDC data into 57.6 MB
Run compression test on 10 events, for both runs, record #bits used
Run 1287 conveniently reads out 1536 channels, 1/10th of full protoDUNE-SP
Run 973 has 2304 channels reading out, scale numbers by 1536/2304

Run Number Difference, Custom Trees Difference, Single Tree Linear Prediction, Custom trees Linear Prediction, Single Tree Size wrt TDR Spec 1287 72.5 MB 73.4 MB 71.5 MB 72.2 MB +25% 0973 (scaled) 70.3 MB 71.1 MB 70.3 MB 70.4 MB +22%

25% larger event size than required by TDR spec
ADC readout encoded on avg in 3.75 bits (TDR spec is 3)
Compression factor 3.20 (TDR spec is 4)

SLIDE 20

Conclusions, so far

Evaluated compression performance on coldbox data
Found two good candidate variables for encoding
Evaluated encoding with “truncated” Huffman compression
Found approach to be generic and robust
~1% penalty for sub-optimal encoding tree, even across events
Expect similar performance for hard-coded common tree for all channels, all

events (simplifies firmware implementation)

No performance loss in presence of “slow” noise
Estimate compressed event size to be 25% larger than TDR spec
No significant channel noise cross-correlation observed (in run #1287)
Likely not much to gain from combining information across channels
Found promising correlations with ADC counts earlier in the stream

(further reduce avg RMS by 10%, i.e. 5% better compression)