NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS
ACM/IEEE NOCS 2018 Torino, Italy Davide Giri Paolo Mantovani Luca P. Carloni Columbia University New York, USA
NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR - - PowerPoint PPT Presentation
NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia University ACM/IEEE NOCS 2018 Paolo Mantovani New York, USA Torino, Italy Luca P. Carloni NVIDIA Parker , 2016. Mobileye EyeQ5, 2020. SOC
ACM/IEEE NOCS 2018 Torino, Italy Davide Giri Paolo Mantovani Luca P. Carloni Columbia University New York, USA
ACM/IEEE NOCS 2018, TORINO, ITALY 2 October 4th, 2018
NVIDIA Parker , 2016. Xilinx Everest, 2018. Mobileye EyeQ5, 2020. Qualcomm Snapdragon 835, 2017.
Challenges
(PLM)
ACM/IEEE NOCS 2018, TORINO, ITALY October 4th, 2018 3
v IBM CAPI, [Y. Shao et al., MICRO ‘16], [M. J. Lyons et al., TACO ‘12] × ARM ACE-lite
ACM/IEEE NOCS 2018, TORINO, ITALY October 4th, 2018 4
[Y. Shao et al., MICRO ‘16]
ACM/IEEE NOCS 2018, TORINO, ITALY October 4th, 2018 5
accelerator execution
ACM/IEEE NOCS 2018, TORINO, ITALY October 4th, 2018 6
Protocol.
Coherence Models.
ACM/IEEE NOCS 2018, TORINO, ITALY 7 October 4th, 2018
[L. P. Carloni, DAC ‘16]
components
ACM/IEEE NOCS 2018, TORINO, ITALY 8 October 4th, 2018
accelerator applications on Linux SMP
ACM/IEEE NOCS 2018, TORINO, ITALY 9 October 4th, 2018
ACM/IEEE NOCS 2018, TORINO, ITALY 10 October 4th, 2018
ACM/IEEE NOCS 2018, TORINO, ITALY 11 October 4th, 2018
simple interface
private cache (added for this work)
ACM/IEEE NOCS 2018, TORINO, ITALY 12 October 4th, 2018
Support for run-time selection of coherence model through one I/O write to the configuration registers
ACM/IEEE NOCS 2018, TORINO, ITALY 13 October 4th, 2018
ACM/IEEE NOCS 2018, TORINO, ITALY 14 October 4th, 2018
\ Requests State \ LLC-coherent Read LLC-coherent Write Invalid Read memory Send data to requestor Go to Valid state Read memory if misaligned Write to LLC Go to Valid state Valid Send data to requestor Write to LLC Shared
We designed 4 custom accelerators:
These accelerators represent a good mix of memory access pattern characteristics:
ACM/IEEE NOCS 2018, TORINO, ITALY 15 October 4th, 2018
The CAD flow from GUI to bitstream is fully automated.
We deployed our SoC on FPGA and we executed applications on Linux SMP.
ACM/IEEE NOCS 2018, TORINO, ITALY 16 October 4th, 2018
LLC winning
0.5x 20x
LLC winning LLC winning LLC winning
Speedup DRAM accesses NC = non-coherent LLC = LLC-coherent
ACM/IEEE NOCS 2018, TORINO, ITALY 17 October 4th, 2018
Speedup DRAM accesses NC = non-coherent LLC = LLC-coherent
Dataset size: 256KB to 512KB
ACM/IEEE NOCS 2018, TORINO, ITALY 18 October 4th, 2018
The fully-coherent model can win for workloads whose data structures fit the accelerator’s private cache. No flush needed.
FC winning FC winning NC = non-coherent LLC = LLC-coherent FC = fully-coherent Speedup
number of active accelerators in the system.
DRAM.
ACM/IEEE NOCS 2018, TORINO, ITALY 19 October 4th, 2018
private cache size LLC size ~ memory footprint of workload fully-coherent model LLC-coherent model non-coherent model BEST MODEL
best choice → Hence, the best choice can vary at runtime.
ACM/IEEE NOCS 2018, TORINO, ITALY 20 October 4th, 2018
Davide Giri Paolo Mantovani Luca P. Carloni
October 4th, 2018 ACM/IEEE NOCS 2018, TORINO, ITALY 22
application to offload a task.
without interrupting the processor cores
we ensure that there exists only a single copy of the data.
both in DRAM and in the LLC.
ACM/IEEE NOCS 2018, TORINO, ITALY 23 October 4th, 2018
implemented through HLS.
the number of sharers and
which caches to flush. For this work:
ACM/IEEE NOCS 2018, TORINO, ITALY 24 October 4th, 2018
ACM/IEEE NOCS 2018, TORINO, ITALY 25 October 4th, 2018