TEA: Enabling State-Intensive Network Functions on Programmable - - PowerPoint PPT Presentation

tea enabling state intensive network functions on
SMART_READER_LITE
LIVE PREVIEW

TEA: Enabling State-Intensive Network Functions on Programmable - - PowerPoint PPT Presentation

TEA: Enabling State-Intensive Network Functions on Programmable Switches Daehyeok Kim Zaoxing Liu , Yibo Zhu ^ , Changhoon Kim , Jeongkeun Lee , Vyas Sekar , Srinivasan Seshan Carnegie Mellon University Microsoft


slide-1
SLIDE 1

TEA: Enabling State-Intensive Network Functions on Programmable Switches

Daehyeok Kim§‡ Zaoxing Liu§, Yibo Zhu^, Changhoon Kim†, Jeongkeun Lee†, Vyas Sekar§, Srinivasan Seshan§

§Carnegie Mellon University ‡Microsoft Research †Intel, Barefoot Switch Division ^ByteDance Inc

slide-2
SLIDE 2

Network functions in the network

Network functions (NFs) are an essential component

  • E.g., Load balancer, Firewall, Network address translator (NAT), …

NF performance and scalability are key challenges

2

slide-3
SLIDE 3

Approaches to deploying network functions

3

Commodity server Virtualization layer NAT LB FW Programmable switch ASIC NAT LB FW NAT LB FW Standalone hardware Fixed-function Performance: O(10 Gbps) Memory: O(10GB) DRAM Price: >$40K Server-based Software (NFV) Programmable Performance: O(10 Gbps) Memory: O(10GB) DRAM Price: $3K Switch-based NF Programmable Performance: O(1 Tbps) Memory: O(10MB) SRAM Price: $10K

slide-4
SLIDE 4

Problem: serving demanding workloads

None of the options can efficiently serve demanding workloads!

  • Millions of concurrent flows (O(100MB)) + high traffic rate (> 1 Tbps)

4

Programmable switch ASIC NAT LB FW NAT LB FW

Promising but cannot maintain flow state L

Standalone hardware Fixed-function Performance: O(10 Gbps) Memory: O(10GB) DRAM Price: >$40K Server-based Software (NFV) Programmable Performance: O(10 Gbps) Memory: O(10GB) DRAM Price: $3K Commodity server Virtualization layer NAT LB FW

Cost- and energy-inefficient L

Switch-based NF Programmable Performance: O(1 Tbps) Memory: O(10MB) SRAM Price: $10K

slide-5
SLIDE 5

Root cause: limited on-chip SRAM

Limited on-chip SRAM space: O(10MB) Infeasible to maintain large flow state within on-chip SRAM

  • E.g., LB state for 10M flows requires ≈100MB

Adding more SRAM would be too expensive L

5

Can we leverage larger and cheaper DRAM near switch ASICs?

slide-6
SLIDE 6

DRAM available on a switch board

6

Switch board Switch data plane (ASIC) Switch control plane Pipeline stages On-chip SRAM DRAM

  • Limited scalability in terms of size and access bandwidth
  • High cost

CPU

Option #1: DRAM on the switch control plane Option #2: On-board, off-chip DRAM

PCIe

Switch board Switch data plane (ASIC) Switch control plane Pipeline stages On-chip SRAM DRAM Extension

slide-7
SLIDE 7

Opportunity: DRAM on commodity servers

+ Scalable memory size and bandwidth + Low cost

7

Switch board Switch data plane (ASIC) Switch control plane

Pipeline stages On-chip SRAM

Server DRAM

slide-8
SLIDE 8

Table Extension Architecture (TEA)

8

Switch board Switch data plane (ASIC) Switch control plane

Pipeline stages On-chip SRAM

Server DRAM Virtual table abstraction for state-intensive NFs using external DRAM on servers Table lookups with low and predictable latency and scalable throughput APIs that allow easy integration with NFs

Key Value

slide-9
SLIDE 9

Outline

Motivation TEA design Results

9

slide-10
SLIDE 10

TEA design overview

10

NF Impl. + TEA P4 API

P4 Compiler Developer Switch board Switch data plane (ASIC) Switch control plane

Pipeline stages On-chip SRAM

Server DRAM Binary

slide-11
SLIDE 11

Strawman: accessing external DRAM via the control plane

11

Challenge 1: Enabling external DRAM access from switch ASIC

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

slide-12
SLIDE 12

Strawman: accessing external DRAM via the control plane

11

Challenge 1: Enabling external DRAM access from switch ASIC

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

High and unpredictable latency L Limited access bandwidth L

slide-13
SLIDE 13

11

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

How to enable switch ASIC to directly access external DRAM without CPUs and hardware modifications?

Packet

Challenge 1: Enabling external DRAM access from switch ASIC

slide-14
SLIDE 14

Challenge 2: Enabling single round-trip table lookup

12

Key Value

5-tuple Addr.

Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

How to ensure a correct table entry can be retrieved in a single memory access?

slide-15
SLIDE 15

13

Key Value

5-tuple Addr.

Server DRAM

Challenge 3: Deferred packet processing

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

How to defer packet processing during lookups without stalling the pipeline or buffering at the switch?

slide-16
SLIDE 16

Challenge 4: Scaling TEA with multiple servers

With a single server, scalability and availability can be limited

14

Server DRAM

A single server can fail

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

Access bandwidth is limited to a single link

slide-17
SLIDE 17

Challenge 4: Scaling TEA with multiple servers

14

Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

How to deal with servers’ availability changes? How to balance access loads across servers?

slide-18
SLIDE 18

Challenge 1: How to access external DRAM in the data plane?

Is it possible to enable ASICs to access external DRAM without hardware modifications and CPU involvement?

15

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

Switch ASICs do not have direct external DRAM access capability!

slide-19
SLIDE 19

Enabling RDMA in the switch data plane

Key idea: Crafting RDMA packets using ASIC’s programmability

16

  • 1. A packet comes into the pipeline

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

slide-20
SLIDE 20

Enabling RDMA in the switch data plane

Key idea: Crafting RDMA packets using ASIC’s programmability

16

  • 1. A packet comes into the pipeline
  • 2. The ASIC adds RDMA headers

to craft an RDMA request

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet RDMA-H

slide-21
SLIDE 21

Enabling RDMA in the switch data plane

Key idea: Crafting RDMA packets using ASIC’s programmability

16

  • 1. A packet comes into the pipeline
  • 2. The ASIC adds RDMA headers

to craft an RDMA request

  • 3. The server NIC replies as it

would for any standard RDMA request

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Payload RDMA-H

slide-22
SLIDE 22

Enabling RDMA in the switch data plane

Key idea: Crafting RDMA packets using ASIC’s programmability

16

  • 1. A packet comes into the pipeline
  • 2. The ASIC adds RDMA headers

to craft an RDMA request

  • 3. The server NIC replies as it

would for any standard RDMA request

  • 4. The ASIC parses the response

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Payload

slide-23
SLIDE 23

Enabling RDMA in the switch data plane

Key idea: Crafting RDMA packets using ASIC’s programmability

16

  • 1. A packet comes into the pipeline
  • 2. The ASIC adds RDMA headers

to craft an RDMA request

  • 3. The server NIC replies as it

would for any standard RDMA request

  • 4. The ASIC parses the response

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Payload

Simple switch-side flow control prevents buffer overflows at the NIC! à Simplified transport is enough!

slide-24
SLIDE 24

Challenge 2: Single round-trip table lookups

17

Can we enable external table lookups in a single round trip? i.e., we need O(1) lookup mechanism

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

RDMA read takes ~2μs: multiple accesses can result in high and unpredictable table lookup latency L

slide-25
SLIDE 25

Cuckoo hashing as a potential approach

18

1 2 1 2 3 4

Cuckoo hash table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

slide-26
SLIDE 26

Cuckoo hashing as a potential approach

18

Can we enable table lookup with a single memory access?

1 2 1 2 3 4

Cuckoo hash table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

hash1 (x) h a s h

2

( x )

slide-27
SLIDE 27

TEA-table: lookup data structure

19

1 2 1 2 3 4

TEA-table in DRAM Server DRAM

Key idea: Repurposing bounded linear probing (BLP)

Designed for improving cache hit rate in software switch [Zhuo’19*]

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

*Dong Zhou. Data Structure Engineering for High Performance Software Packet Processing. Ph.D. Dissertation. Carnegie Mellon University, 2019.

slide-28
SLIDE 28

TEA-table: lookup data structure

19

1 2 1 2 3 4

TEA-table in DRAM Server DRAM

Key idea: Repurposing bounded linear probing (BLP)

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

Trading space efficiency for contiguous allocation

slide-29
SLIDE 29

TEA-table: lookup data structure

19

1 2 1 2 3 4

TEA-table in DRAM Server DRAM

Key idea: Repurposing bounded linear probing (BLP)

Can read two buckets in

  • ne RDMA read!

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

hash (x)

slide-30
SLIDE 30

Challenge 3: Deferred packet processing

20

Can we defer only a select packet without stalling the pipeline?

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Server DRAM

Packet

Packet processing needs to be deferred until a lookup completes

slide-31
SLIDE 31

Offloading packet store to TEA-table for asynchronous packet processing

21

Idea #1: Employing scratchpad in TEA-table to buffer packets

1 2 Scratchpad 1 2 3 4

TEA-table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

slide-32
SLIDE 32

Offloading packet store to TEA-table for asynchronous packet processing

21

Idea #1: Employing scratchpad in TEA-table to buffer packets

1 2 Scratchpad 1 2 3 4

TEA-table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

slide-33
SLIDE 33

Offloading packet store to TEA-table for asynchronous packet processing

21

Idea #1: Employing scratchpad in TEA-table to buffer packets

1 2 Scratchpad Packet 1 2 3 4

TEA-table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet RDMA-Write

slide-34
SLIDE 34

Offloading packet store to TEA-table for asynchronous packet processing

21

Idea #1: Employing scratchpad in TEA-table to buffer packets

1 2 Scratchpad Packet 1 2 3 4

TEA-table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM RDMA-Read

slide-35
SLIDE 35

Offloading packet store to TEA-table for asynchronous packet processing

21

Idea #1: Employing scratchpad in TEA-table to buffer packets

1 2 Scratchpad Packet 1 2 3 4

TEA-table in DRAM Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

Due to the ASIC’s parser limitation, the original packet and the bucket cannot be parsed efficiently L

slide-36
SLIDE 36

Offloading packet store to TEA-table for asynchronous packet processing

22

Idea #2: Employing shadow table in TEA-table

1 2 Scratchpad Packet 1 2 3 4

Server DRAM Shadow table Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM

Slightly increase of DRAM usage

slide-37
SLIDE 37

Offloading packet store to TEA-table for asynchronous packet processing

22

Idea #2: Employing shadow table in TEA-table

1 2 Scratchpad Packet 1 2 3 4

Server DRAM Shadow table Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

The ASIC can parse the buckets and process the packet with a right table entry J

slide-38
SLIDE 38

Challenge 4: Scaling TEA with multiple servers

23

See paper for details!

Server DRAM Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

Small on-chip cache for load balancing and higher throughput Repurpose ASICs’ capability for reacting servers’ availability changes

slide-39
SLIDE 39

Putting it all together

24

Leveraging programmability of ASICs for accessing external DRAM in the data plane

TEA provides a virtual table abstraction to state-intensive NFs

Repurposing BLP for constant time lookup Small on-chip cache for scalability Server DRAM Offloading packet store for

  • async. packet processing

Switch board Switch data plane (ASIC) Switch control plane

On-chip SRAM Packet

slide-40
SLIDE 40

Outline

Motivation TEA design Results

25

slide-41
SLIDE 41

Implementation and evaluation setup

Implemented TEA API as P4 modules (aka. control block in P4) Implemented canonical NFs including NAT and stateful firewall

  • Load 10 million table entries

Testbed setup: Tofino-based switch + 12 servers with RDMA NICs

26

slide-42
SLIDE 42

Does TEA access channel provide low and predictable lookup latency?

27

1 2 3 64 128 256 512 1024 1500 Latency (us) Packet size (bytes) TEA RDMA-read 99%-tile latency

slide-43
SLIDE 43

0.5 1 1 2 3 CDF Processing latency (μs) Baseline (SRAM only) Uniform α=0.9 α=0.99 Real Served by on-chip cache

How does TEA affect NF processing latency?

28

Served by TEA-table Flow size skewness (α) affects

  • n-chip cache hit rate
slide-44
SLIDE 44

50 100 150 1 2 4 8 Million lookups / second Number of DRAM servers No cache Uniform α=0.95 α=0.99

Does TEA provide scalable throughput?

29

Traffic generation rate

slide-45
SLIDE 45

Other results

  • Cost efficiency of TEA-enabled NFs compared to server-

based counterparts.

  • Handling server failures within a second with a slight

throughput degradation

  • Less than 9% of switch ASIC resource usage

30

See paper for more details!

slide-46
SLIDE 46

Limited on-chip memory restricts potentials of switch-based NFs Table Extension Architecture for Programmable Switches:

  • RDMA-based external DRAM access
  • BLP-based efficient lookup table structure
  • Asynchronous packet processing by offloading packet store
  • Small on-chip cache for scalability

Provides low and predictable latency with scalable throughput

  • Latency: 1.8 - 2.2 μs
  • Throughput: 138 million lookups/sec with 8 servers

Conclusion

31