BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE - - PowerPoint PPT Presentation

based signal processing of
SMART_READER_LITE
LIVE PREVIEW

BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE - - PowerPoint PPT Presentation

S5302 - OPTIMIZATION OF GPU- BASED SIGNAL PROCESSING OF RADIO TELESCOPES VINAY DESHPANDE HARSHAVARDHAN REDDY DEVELOPER TECHNOLOGY ENGINNER, NCRA NVIDIA INTRODUCTION NCRA National Center for Radio Astrophysics Pune, India.


slide-1
SLIDE 1

VINAY DESHPANDE DEVELOPER TECHNOLOGY NVIDIA

S5302 - OPTIMIZATION OF GPU- BASED SIGNAL PROCESSING OF RADIO TELESCOPES

HARSHAVARDHAN REDDY ENGINNER, NCRA

slide-2
SLIDE 2

INTRODUCTION

NCRA – National Center for Radio Astrophysics

Pune, India. http://ncra.tifr.res.in/ncra

GMRT – Giant Meterwave Radio Telescope

Situated at Kodad near Pune, India. http://gmrt.ncra.tifr.res.in/ Consists of 30 dish antennas 45 m diameter each, spread over 25 Km Used by radio-astronomers world-wide

slide-3
SLIDE 3

uGMRT EFFORT

The GMRT backend has been upgraded recently

The “uGMRT”

Key change: Bandwidth 32 -> 200/400 MHz

Prototype system with 16 antennas – 8 compute nodes up and running GPUs upgrade from Fermi to Kepler

Optimizing software backend

For better science, less power and reduction in cost On going work involving NVIDIA and NCRA teams

Contribution towards SKA

slide-4
SLIDE 4

GMRT BACKEND

Each antenna has two polarizations If the antenna is operating at 200 MHz bandwidth

Sampling needs to be frequency 400 MHz Produces 400 million samples/sec 800 million samples per antenna per sec Total 800 * 32 = 25.6 G samples/sec (2 additional signal sources for debug and test)

Signal processing backend needs to process all these samples in real-time

Antenna - 1

Two polarizations

A2D + more A2D + more

Bandwidth 200 MHz Sampling 400 MHz 400 M samples/sec 8-bit samples 400 M samples/sec 8-bit samples

slide-5
SLIDE 5

BACKEND: COMPUTE INFRASTRUCTURE

Samples from two antennas is fed to a single compute node

The number could change for other telescopes Can be decided by I/O requirements

16 compute nodes

Connected over high-speed network

Each compute node has

One CPU One or two GPUs

Antenna - 1 Antenna - 2 Compute node 1 Antenna - 3 Antenna - 4 Compute node 2

… …

slide-6
SLIDE 6

GPU CORRELATOR

Operations involved

Data format conversion (Unpacking) Discrete Fourier Transform (DFT) Phase Rotation Multiply-And-Accumulate (MAC)

slide-7
SLIDE 7
  • 1. UNPACKING

For converting each sample

8-bit read (integer) and 32-bit write (floating point)

Dominated by I/O Unpacking is immediately followed by DFT

32-bit data per sample needs to be read again

This read after write trip can be saved

cuFFT callbacks introduced in CUDA 6.5

cuFFT callbacks can be used to combine unpacking with FFT operation Result - overhead of unpacking is reduced by 25%

slide-8
SLIDE 8
  • 2. DISCRETE FOURIER TRANSFORM

DFT is implemented using cuFFT library APIs cuFFT Mode selection

R2C C2C – Requires additional 2x2 Butterfly kernel

Several possible combinations of input and output callback

Unpacking, Phase Rotation, 2x2 butterfly

No callbacks Unpacking callback Phase Rotation 2x2 Butterfly callback R2C Tested Tested, second best Tested NA C2C Tested Tested, best NA Tested

slide-9
SLIDE 9
  • 3. PHASE ROTATION

Essentially multiplication by a constant

Constant depends on antenna, frequency channel and time slice

The kernel computes each constant on-the-fly

Lots of math operations

Redundancies in computation identified and removed

Improvement in performance 10%

Switching from CUDA 6.0 to 6.5 boosted performance by 50%

slide-10
SLIDE 10
  • 4. MAC

The most costly operation

Cost grows proportional to (antenna)2

Choices for MAC routines

GMRT – original routine xGPU – Mike Clark’s highly optimized MAC library

xGPU performs better is almost all cases

More so for higher number of antennas

Side effect – Input/output reordering is required

(antenna, time, frequency) -> (time, frequency, antenna) Shared memory based implementation achieves bandwidth of 128 GB/s on K20

slide-11
SLIDE 11

PERFORMANCE OF MAC

500 1000 1500 2000 2500

1K 2K 4K 8K 16K 32K

TIME IN MS

xGPU vs GMRT

GMRT MAC xGPU MAC

xGPU performs ~35% better than GMRT

slide-12
SLIDE 12

MAC KERNELS ON K40

500 1000 1500 2000 2500 1k 2k 4k 8k 16k TIME IN MS

Performance of GMRT MAC K20 vs K40

K20 K40 500 1000 1500 2000 2500 1k 2k 4k 8k 16k 32k TIME IN MS

Performance of xGPU MAC K20 vs K40

K20 K40

25-27% improvements ~18% improvements

slide-13
SLIDE 13

OVERALL RESULTS

slide-14
SLIDE 14

OVERALL IMPROVEMENTS

500 1000 1500 2000 2500 3000 3500 4000 4500 Unpacking cuFFT Phase Rotation MAC Total TIME IN MS

Overall improvement for 16K channels on single K20

Baseline Optimized Real-Time

25% faster

slide-15
SLIDE 15

OVERALL IMPROVEMENTS

500 1000 1500 2000 2500 3000 3500 4000 4500 1K 2K 4K 8K 16K 32K TIME IN MS

Optimized Correlator Performance

Baseline Optimized

20-25% better performance

slide-16
SLIDE 16

RFI REJECTION

slide-17
SLIDE 17

RFI REJECTION

RFI – Radio Frequency Interference RFI needs to be removed in real-time GMRT backend has time-domain RFI filtering implemented

Desirable to have RFI filtering in both domains

Correlator RFI filter (time-domain) RFI filter (frequency-domain)

slide-18
SLIDE 18

RFI REJECTION CODE

GMRT implements Median Absolute Deviation (MAD) based filtering

MAD is a robust estimator

Stream of input data is divided in fixed width windows For each window

First MAD is computed Then threshold filter is applied

All the windows can be processed concurrently GMRT has two implementations of the algorithm

Optimized for small window – (< 1K) Optimized for large window – (> 4k)

slide-19
SLIDE 19

IMPROVEMENTS IN RFI FILTERING

Implicit histogram computation

Second histogram is computed from first instead of re-fetching samples

Integers instead of floating point numbers

𝑁𝐵𝐸 = 𝑁𝐵𝐸1 +

𝑁𝐵𝐸2 2

Helps in removing calls to ceil, floor etc.

Reduced branching

8 if-else blocks reduced to 4

Reduction in launch latency overhead

Launching smaller number of bigger kernels Side effect of combining kernels – temporary storage avoided

Single version for all window sizes

slide-20
SLIDE 20

RFI FILTERING RESULTS

5 10 15 20 25 30 0.5K 1K 2K 4K TIME IN MS WINDOW SIZE

RFI Rejection performance at small window

Baseline small window Optimized

3-20x faster

slide-21
SLIDE 21

RFI FILTERING RESULTS

2 4 6 8 10 12 14 16 4K 8K 16K 32K TIME IN MS AXIS TITLE

RFI Rejection performance at large window

Baseline large window Optimized

2-10x faster

slide-22
SLIDE 22

REFERENCES

S3225 - Powering Real-time Radio Astronomy Signal Processing with GPUs

GTC - 2013, Harshavardhan Reddy, Pradeep Gupta

S4538 - Real-Time RFI Rejection Techniques for the GMRT Using GPUs

GTC 2014, Rohini Joshi

NCRA-NVIDIA collaboration work report phase 1 and phase 2

slide-23
SLIDE 23

ACKNOWLEDGEMENT

Team NCRA

  • Dr. Yashwant Gupta

Harshavardhan Reddy Rohini Joshi Niruj

slide-24
SLIDE 24

THANK YOU