[PPT] - On Accelerating Pair-HMM Computations in Programmable Hardware PowerPoint Presentation

SLIDE 1

On Accelerating Pair-HMM Computations in Programmable Hardware

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, Ravishankar K. Iyer

SLIDE 2

Contributions

Design and implementation for an accelerator to

compute the Forward Algorithm (FA) on Pair- Hidden Markov Models (PHMM) models.

Demonstrate value of the accelerator supporting

computational genomics workflows where PHMM is used to identify mutations in genomes

Optimize accelerator architecture for both the

algorithm and common input data characteristics

Reduce compute time: 14.85× higher throughput
Reduce operational cost (in terms of energy

consumption): 147.49× higher throughput per unit energy

[6]

[10] [11] [13] [12] This paper GPU Other FPGA CPU

Citations are consistent with those in paper

1

SLIDE 3

Forward Algorithm on Pair-HMM Models

PHMM models are Bayesian multinets that allow for a

probabilistic interpretation of the alignment problem

An alignment models the homology between two sequences via a

series of mutations, insertions, and deletions of nucleotides.

FA algorithm computes of statistical similarity by

considering all alignments between two sequences and computing the overall alignment probability by summing

ver them
Can be described by the following equations

2

Symbol in Sequence 2 Symbol in Sequence 1 Hidden State Transitions between hidden states Plate Class Node

Equations describe anti-diagonal data-dependecies

SLIDE 4

PHMM Forward Algorithm in Bioinformatics

PHMMs form the basis of the variant

detection tool GATK HaplotypeCaller

Used to pick n-best haplotypes from by

maximizing likelihood of a read originating from the haplotype

FA algorithm used
Constitutes >70% of the runtime of the

GATK HaplotypeCaller

Executes >3E7 times for a standard clinical

human dataset

3

Diagram from GATK Documentation: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148

SLIDE 5

Shortcomings of Related Work

Past work explores use of FPGAs/ASICs
Based on systolic array designs
Exploit anti-diagonal parallelism in

recurrence pattern

Common shortcoming is that they are
ptimized only for the algorithm and not

input data characteristics

Input size variability can lead to idle cycles

for systolic array based designs.

CDF shows nearly uniform distribution of

input sizes for small (<250) and large (>350) input string size for computation on NA12878 sample 4

SLIDE 6

Our Design

Design Goal: Optimize design to execute different input sizes in parallel
Expend chip budget on maximizing inter-task parallelism
Handle intra-task parallelism through aggressive pipelining

5

250 MHz 250 MHz

IBM Supplied POWER Service Layer (PSL)

Internal Input Cache Internal Output Cache Serializer Serializer Serializer Bus Array of PEs Bus Scheduler CAPI Controller

Host-accelerator interface using IBM CAPI Out of order issue unit to PEs as well as write back logic encapsulated in the bus scheduling strategy

400 MHz Quality to “a” parameter lookup table PHMM Data Path Scratchpad Buffer Memory Scheduler Address Generator

IEEE-754 encoded “a” parameters Calculated “f” metrics Input “f” metrics Write address Read address Output “f” metrics ASCII encoded quality parameters

Specialized data path and schedule to ensure that there are no idle cycles while computing Memory scheduler minimizes scratchpad buffer size used to store intermediate results in Scratchpad buffer

SLIDE 7

Processing Element (PE) Design

Goal: Schedule operations to minimize idle cycles
Schedule presented above has no idle cycles
Schedule temporally multiplexes the adders and multipliers
Entire pipeline is 8-deep (8 Operations in flight at a time)

Multiplier 1 Multiplier 2 Adder 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Time

Ai

Di

Gi Ci

Bi Ei Fi Li−6 Ki−4 Hi−1 Ii−1 Ji−1

A A D G C B E F L D G C E F L B H H K K I J I J

Circuit representation of the computation datapath Gantt chart corresponding to schedule of operations 6

SLIDE 8

Minimize Storage Requirements

Temporary scratchpad space is

required to store intermediate

utputs produced from the FA

algorithm

We minimize this space by following

the anti-diagonal recursion pattern

f the FA algorithm
As a result, we need only O(L) space

instead of O(L2) space to store entire matrix.

7

X

Completed blocks Remaining blocks Stored blocks Scratchpad Memory

Computing “x”

verwrites unused

values

L L

Recursion Lattice from Equation 1 Memory State

Fill memory along anti-diagonal of the recursion lattice

Current block

X

SLIDE 9

Dealing with Accelerator Invocation Overheads

Accelerator invocation overhead significantly

reduces performance because of OS overhead of initializing accelerator

Solution: Amortize cost of accelerator invocation by

batching multiple invocations

OS sends batch of tasks to acc. Hardware dist across PEs
Demonstrate several approaches to select task

batches

Simple task batching
Common prefix memoization
FA on partially ordered strings

1 10 100 1000 1 10 100 1000 10000

Latency (μs) / Task Batch Size (Tasks)

8

Task batching: Significant drop in mean latency of a PHMM task when OS overhead is amortized over large batches

SLIDE 10

Common Prefix Memoization

Similar inputs to PHMM) have common prefixes
Naïve algorithm recomputes PHMM for all pairs of

strings

Our solution:
Construct a prefix trie to find the longest common

prefix in an input task batch

Compute PHMM FA for prefix only once
Saves compute time and host-accelerator bandwidth

9

AAACGC A C G Compressed Trie

CGCAAA

Haplotype Precompute Prefix String Reuse pre- computed values

CCGCAAA

Compute last row Haplotype

1 2 3

Example
(AAACGCA, AAACCGG); (AAACGCC, AAACCGG); (AAACGCG, AAACCGG)
Read (Input 1) has common prefix for a single haplotype (Input 2)
Construct TRIE for Input 1
Precompute matrix for prefix on accelerator
Compute last row and column on host CPU

SLIDE 11

FA on Partially Ordered Strings

Inputs to the PHMM accelerator in GATK is

computed from DeBruijn graphs

Core Idea:
Do not dispatch multiple paths from DeBruijn

graphs as separate tasks

Dispatch entire graph at same time
Present an extension of the POA algorithm

[1] for computing FA between single read and entire DeBruijn graph

10

A C C T A C A A

A C G C T

A C A A

Traditional PHMM Dependency Lattice POA based PHMM Dependency Lattice

[1] C. Lee, C. Grasso, and M. F. Sharlow, “Multiple sequence alignment using partial order graphs,” Bioinformatics, vol. 18, no. 3, pp. 452–464, Mar 2002.

SLIDE 12

Results: Performance Benchmarking

[12] (Best GPU)

[13] (Best FPGA) Power8 Chip

14.85× higher throughput than an 8-core CPU baseline

(that uses SIMD and multi-threading)

147.49× improvement in throughput per unit of energy

expended

Performance of the accelerator in a PHMM micro- benchmark

Amdahl’s Law Limit

Performance of the end-to-end GATK HaplotypeCaller application

3.287× speedup over CPU-only baseline
3.48× is maximum attainable speedup accroding to

Amdahl’s Law 11

SLIDE 13

Results: On-Chip Resource Utilization

The use of logic slices is the limiting factor
Potential for larger gains in micro-benchmark performance for larger FPGAs
Memory bandwidth becomes a bottleneck [Simulation results in paper]
Negligible gains to be had in terms of end-to-end application performance
Already close to Amdahl’s law limit

Physical Layout on a Xilinx XC7VX6905T

CAPI Interface

44 PEs

Clock

31% Signals 31% Logic 10% BRAM 13% DSP 8% MMCM 4% PCIe 4%

12

SLIDE 14

Conclusions

We demonstrate an FPGA based accelerator for the PHMM FA algorithm

that achieves

14.85× higher throughput than CPU baseline
147.49× higher throughput per unit energy expended
Immediate application in variant discovery and genotyping workloads
Takeaway: Design methodology of using input data characteristics in

addition to algorithmic characteristics to specialize accelerator design can be more generally

13

SLIDE 15

Questions?

Code available at https://github.com/CSLDepend/PairHMM
Email authors at ssbaner2@illinois.edu

14