Tsing nghua hua University versity Introduction Deep learning - - PowerPoint PPT Presentation

tsing nghua hua university versity introduction
SMART_READER_LITE
LIVE PREVIEW

Tsing nghua hua University versity Introduction Deep learning - - PowerPoint PPT Presentation

Deep ep500 500 BOF 2018 Jidong ong Zhai Tsing nghua hua University versity Introduction Deep learning has widely used in lots of areas Introduction A lot of deep learning frameworks, compute libraries and acceleration devices


slide-1
SLIDE 1

Jidong

  • ng Zhai

Tsing nghua hua University versity Deep ep500 500 BOF 2018

slide-2
SLIDE 2

Introduction

  • Deep learning has widely used in lots of areas
slide-3
SLIDE 3

Introduction

···

CNTK

Frameworks Compute Libraries

···

BLAS

Compute Devices

TPU

···

  • A lot of deep learning frameworks, compute libraries and acceleration

devices

slide-4
SLIDE 4

Introduction

Benchmark

? ? ?

  • However, how to evaluate?

···

CNTK

Frameworks Compute Libraries

···

BLAS

Compute Devices

TPU

···

slide-5
SLIDE 5

Introduction

Benchmark

? ? ?

···

CNTK

Frameworks Compute Libraries

···

BLAS

Compute Devices

TPU

···

Which is better? Running Time Resource Use Scalability Efficiency … Set Optimization Target Promote Development

  • However, how to evaluate?
slide-6
SLIDE 6

Related Deep Learning Benchmarks

convnet- benchmarks1 DeepBench2 DAWNBench3 TensorFlow Benchmark4

Target Framework Compute Library Compute Library Compute Device Compute Library Framework Framework Models Granularity Neural Network Basic Operation Neural Network Neural Network Diversity Only CNN Training Inference 2 CNN + 1 RNN 4 CNN Dataset ImageNet Dummy Data CIFAR10、ImageNet SQuAD ImageNet Metrics Time Per Iteration Time Training Time and Cost to certain Accuracy Total Training Time

  • 1. convnet-benchmarks: https://github.com/soumith/convnet-benchmarks
  • 2. Baidu DeepBench: https://github.com/baidu-research/DeepBench
  • 3. Cody A. Coleman et al. DAWNBench: An End-to-End Deep Learning Benchmark and Competition. NIPS 2017
  • 4. TensorFlow Benchmark https://www.tensorflow.org/performance/benchmarks

Low Diversity Limited Dataset Single Metric

slide-7
SLIDE 7
slide-8
SLIDE 8

Related Deep Learning Benchmarks

  • 1. https://mlperf.org/

MLPerf1

Evaluation Target Framework Compute Device Characteristics Granularity Neural Network Diversity

  • 1. Image(Classification, Detection)
  • 2. NLP(Translation, Sentiment Analysis)
  • 3. Speech(Recognition)
  • 4. Reinforcement Learning & Recommendation

Dataset ImageNet, COCO, WMT, Librispeech, MovieLens, … Evaluation Metrics Training Time, Power Use and Cost to certain Accuracy

Various Applications Various Datasets

slide-9
SLIDE 9

How to evaluate HPC systems for machine learning?

slide-10
SLIDE 10

Our Work on Workload Analysis for Deep Learning

Image Classification Machine Translation Language Model Question Answering

Applications

VGG ResNet Seq2seq RNN LM AoA Reader

Models Dataset

Real Data

WikiText-2 CBTest Cifar Tatoeba

Dummy Data

Real time Controllable Easy to obtain Generative

  • Preliminary workload analysis
slide-11
SLIDE 11

Our Work

  • Time
  • Time of every operation type within one iteration
  • Time of phases within one iteration

100 200 300 400 500 600 700 VGG ResNet RNN LM AoA Reader Seq2seq Time(ms) Data Forward Backward Loss Update

slide-12
SLIDE 12

Workload Analysis

  • Memory Usage
  • Memory Usage Break Down
  • Memory Usage – Input Size

2000 4000 6000 8000 10000 12000 14000 16000 VGG ResNet RNN LM AoA Reader Seq2seq Memory Use(MB) Weight Mediate Result + Temp 0.0 0.2 0.4 0.6 0.8 1.0 2,048 4,096 6,144 8,192 10,240 12,288 14,336 16,384 18,432 50000 100000 150000 200000 Ratio Memory Use(MB) Pic Area(Pixel2) Traning Inference Training/Inference 0.0 0.2 0.4 0.6 0.8 1.0 2,048 4,096 6,144 8,192 10,240 12,288 14,336 16,384 18,432 200 400 600 800 1000 1200 Ratio Memory Use(MB) Sequence Length Training Inference Training/Inference

slide-13
SLIDE 13

Workload Characterization

  • Hardware Counters
  • For GPU

GPU Occupancy Warp Execution Efficiency Warp Non-Pred Execution Efficiency Bandwidth Utilization TFLPOS Normalized 1 0.46 1.00 1.00 4.02 5.65

slide-14
SLIDE 14

Questions about an HPC Oriented Deep Learning Benchmark

  • Questions we need to think:
  • Model Selection
  • Various application areas?
  • A synthetic model with main features?
  • Dataset
  • Fixed data set (Imagenet)?
  • A Generative Data?
  • Metrics
  • Time for training?
  • Gflops?
  • AI operations per second?
slide-15
SLIDE 15