On Accelerating Pair-HMM Computations in Programmable Hardware - - PowerPoint PPT Presentation

on accelerating pair hmm computations in programmable
SMART_READER_LITE
LIVE PREVIEW

On Accelerating Pair-HMM Computations in Programmable Hardware - - PowerPoint PPT Presentation

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, Ravishankar K. Iyer On Accelerating Pair-HMM Computations in Programmable Hardware Contributions Design and implementation for an accelerator to This


slide-1
SLIDE 1

On Accelerating Pair-HMM Computations in Programmable Hardware

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, Ravishankar K. Iyer

slide-2
SLIDE 2

Contributions

  • Design and implementation for an accelerator to

compute the Forward Algorithm (FA) on Pair- Hidden Markov Models (PHMM) models.

  • Demonstrate value of the accelerator supporting

computational genomics workflows where PHMM is used to identify mutations in genomes

  • Optimize accelerator architecture for both the

algorithm and common input data characteristics

  • Reduce compute time: 14.85× higher throughput
  • Reduce operational cost (in terms of energy

consumption): 147.49× higher throughput per unit energy

  • [6]

[10] [11] [13] [12] This paper GPU Other FPGA CPU

Citations are consistent with those in paper

1

slide-3
SLIDE 3

Forward Algorithm on Pair-HMM Models

  • PHMM models are Bayesian multinets that allow for a

probabilistic interpretation of the alignment problem

  • An alignment models the homology between two sequences via a

series of mutations, insertions, and deletions of nucleotides.

  • FA algorithm computes of statistical similarity by

considering all alignments between two sequences and computing the overall alignment probability by summing

  • ver them
  • Can be described by the following equations

2

Symbol in Sequence 2 Symbol in Sequence 1 Hidden State Transitions between hidden states Plate Class Node

Equations describe anti-diagonal data-dependecies

slide-4
SLIDE 4

PHMM Forward Algorithm in Bioinformatics

  • PHMMs form the basis of the variant

detection tool GATK HaplotypeCaller

  • Used to pick n-best haplotypes from by

maximizing likelihood of a read originating from the haplotype

  • FA algorithm used
  • Constitutes >70% of the runtime of the

GATK HaplotypeCaller

  • Executes >3E7 times for a standard clinical

human dataset

3

Diagram from GATK Documentation: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148

slide-5
SLIDE 5

Shortcomings of Related Work

  • Past work explores use of FPGAs/ASICs
  • Based on systolic array designs
  • Exploit anti-diagonal parallelism in

recurrence pattern

  • Common shortcoming is that they are
  • ptimized only for the algorithm and not

input data characteristics

  • Input size variability can lead to idle cycles

for systolic array based designs.

  • CDF shows nearly uniform distribution of

input sizes for small (<250) and large (>350) input string size for computation on NA12878 sample 4

slide-6
SLIDE 6

Our Design

  • Design Goal: Optimize design to execute different input sizes in parallel
  • Expend chip budget on maximizing inter-task parallelism
  • Handle intra-task parallelism through aggressive pipelining

5

250 MHz 250 MHz

IBM Supplied POWER Service Layer (PSL)

Internal Input Cache Internal Output Cache Serializer Serializer Serializer Bus Array of PEs Bus Scheduler CAPI Controller

Host-accelerator interface using IBM CAPI Out of order issue unit to PEs as well as write back logic encapsulated in the bus scheduling strategy

400 MHz Quality to “a” parameter lookup table PHMM Data Path Scratchpad Buffer Memory Scheduler Address Generator

IEEE-754 encoded “a” parameters Calculated “f” metrics Input “f” metrics Write address Read address Output “f” metrics ASCII encoded quality parameters

Specialized data path and schedule to ensure that there are no idle cycles while computing Memory scheduler minimizes scratchpad buffer size used to store intermediate results in Scratchpad buffer

slide-7
SLIDE 7

Processing Element (PE) Design

  • Goal: Schedule operations to minimize idle cycles
  • Schedule presented above has no idle cycles
  • Schedule temporally multiplexes the adders and multipliers
  • Entire pipeline is 8-deep (8 Operations in flight at a time)

Multiplier 1 Multiplier 2 Adder 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Time

Ai

Di

Gi Ci

Bi Ei Fi Li−6 Ki−4 Hi−1 Ii−1 Ji−1

A A D G C B E F L D G C E F L B H H K K I J I J

Circuit representation of the computation datapath Gantt chart corresponding to schedule of operations 6

slide-8
SLIDE 8

Minimize Storage Requirements

  • Temporary scratchpad space is

required to store intermediate

  • utputs produced from the FA

algorithm

  • We minimize this space by following

the anti-diagonal recursion pattern

  • f the FA algorithm
  • As a result, we need only O(L) space

instead of O(L2) space to store entire matrix.

7

X

Completed blocks Remaining blocks Stored blocks Scratchpad Memory

Computing “x”

  • verwrites unused

values

L L

Recursion Lattice from Equation 1 Memory State

Fill memory along anti-diagonal of the recursion lattice

Current block

X

slide-9
SLIDE 9

Dealing with Accelerator Invocation Overheads

  • Accelerator invocation overhead significantly

reduces performance because of OS overhead of initializing accelerator

  • Solution: Amortize cost of accelerator invocation by

batching multiple invocations

  • OS sends batch of tasks to acc. Hardware dist across PEs
  • Demonstrate several approaches to select task

batches

  • Simple task batching
  • Common prefix memoization
  • FA on partially ordered strings

1 10 100 1000 1 10 100 1000 10000

Latency (μs) / Task Batch Size (Tasks)

8

Task batching: Significant drop in mean latency of a PHMM task when OS overhead is amortized over large batches

slide-10
SLIDE 10

Common Prefix Memoization

  • Similar inputs to PHMM) have common prefixes
  • Naïve algorithm recomputes PHMM for all pairs of

strings

  • Our solution:
  • Construct a prefix trie to find the longest common

prefix in an input task batch

  • Compute PHMM FA for prefix only once
  • Saves compute time and host-accelerator bandwidth

9

AAACGC A C G Compressed Trie

CGCAAA

Haplotype Precompute Prefix String Reuse pre- computed values

CCGCAAA

Compute last row Haplotype

1 2 3

  • Example
  • (AAACGCA, AAACCGG); (AAACGCC, AAACCGG); (AAACGCG, AAACCGG)
  • Read (Input 1) has common prefix for a single haplotype (Input 2)
  • Construct TRIE for Input 1
  • Precompute matrix for prefix on accelerator
  • Compute last row and column on host CPU
slide-11
SLIDE 11

FA on Partially Ordered Strings

  • Inputs to the PHMM accelerator in GATK is

computed from DeBruijn graphs

  • Core Idea:
  • Do not dispatch multiple paths from DeBruijn

graphs as separate tasks

  • Dispatch entire graph at same time
  • Present an extension of the POA algorithm

[1] for computing FA between single read and entire DeBruijn graph

10

A C C T A C A A

A C G C T

A C A A

Traditional PHMM Dependency Lattice POA based PHMM Dependency Lattice

[1] C. Lee, C. Grasso, and M. F. Sharlow, “Multiple sequence alignment using partial order graphs,” Bioinformatics, vol. 18, no. 3, pp. 452–464, Mar 2002.

slide-12
SLIDE 12

Results: Performance Benchmarking

  • [12] (Best GPU)

[13] (Best FPGA) Power8 Chip

  • 14.85× higher throughput than an 8-core CPU baseline

(that uses SIMD and multi-threading)

  • 147.49× improvement in throughput per unit of energy

expended

Performance of the accelerator in a PHMM micro- benchmark

  • Amdahl’s Law Limit

Performance of the end-to-end GATK HaplotypeCaller application

  • 3.287× speedup over CPU-only baseline
  • 3.48× is maximum attainable speedup accroding to

Amdahl’s Law 11

slide-13
SLIDE 13

Results: On-Chip Resource Utilization

  • The use of logic slices is the limiting factor
  • Potential for larger gains in micro-benchmark performance for larger FPGAs
  • Memory bandwidth becomes a bottleneck [Simulation results in paper]
  • Negligible gains to be had in terms of end-to-end application performance
  • Already close to Amdahl’s law limit

Physical Layout on a Xilinx XC7VX6905T

CAPI Interface

44 PEs

  • Clock

31% Signals 31% Logic 10% BRAM 13% DSP 8% MMCM 4% PCIe 4%

  • 12
slide-14
SLIDE 14

Conclusions

  • We demonstrate an FPGA based accelerator for the PHMM FA algorithm

that achieves

  • 14.85× higher throughput than CPU baseline
  • 147.49× higher throughput per unit energy expended
  • Immediate application in variant discovery and genotyping workloads
  • Takeaway: Design methodology of using input data characteristics in

addition to algorithmic characteristics to specialize accelerator design can be more generally

13

slide-15
SLIDE 15

Questions?

  • Code available at https://github.com/CSLDepend/PairHMM
  • Email authors at ssbaner2@illinois.edu

14