Downloading a Billion Files in Python A case study in - - PowerPoint PPT Presentation

downloading a billion files in python
SMART_READER_LITE
LIVE PREVIEW

Downloading a Billion Files in Python A case study in - - PowerPoint PPT Presentation

Downloading a Billion Files in Python A case study in multi-threading, multi-processing, and asyncio J a m e s S a r y e r w i n n i e @ j s a r y e r Our Task Our Task There is a remote server that stores files Our Task There is a remote


slide-1
SLIDE 1

J a m e s S a r y e r w i n n i e

A case study in multi-threading, multi-processing, and asyncio

Downloading a Billion Files in Python

@ j s a r y e r

slide-2
SLIDE 2

Our Task

slide-3
SLIDE 3

Our Task

There is a remote server that stores files

slide-4
SLIDE 4

Our Task

There is a remote server that stores files The files can be accessed through a REST API

slide-5
SLIDE 5

Our Task

There is a remote server that stores files The files can be accessed through a REST API Our task is to download all the files on the remote server to our client machine

slide-6
SLIDE 6

Our Task (the details)

slide-7
SLIDE 7

Our Task (the details)

What client machine will this run on?

slide-8
SLIDE 8

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory What client machine will this run on?

slide-9
SLIDE 9

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory What client machine will this run on? What about the network between the client and server?

slide-10
SLIDE 10

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server?

slide-11
SLIDE 11

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server? How many files are on the remote server?

slide-12
SLIDE 12

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server?

slide-13
SLIDE 13

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? When do you need this done?

slide-14
SLIDE 14

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?

slide-15
SLIDE 15

Files Page Page Page

File Server Rest API

slide-16
SLIDE 16

File Server Rest API

GET /list

Files Page Page FileNames NextMarker

slide-17
SLIDE 17

File Server Rest API

GET /list

Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

slide-18
SLIDE 18

File Server Rest API

GET /list?next-marker=token

Files Page Page FileNames NextMarker

slide-19
SLIDE 19

File Server Rest API

GET /list?next-marker=token

Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

slide-20
SLIDE 20

File Server Rest API

GET /list GET /get/{filename} {"FileNames": ["file1", "file2", ...]} {"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"} (File blob content) GET /list?next-marker={token}

slide-21
SLIDE 21

Caveats

This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Sometimes concrete examples can be helpful

slide-22
SLIDE 22

Caveats

This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Always profile and test for yourself Sometimes concrete examples can be helpful

slide-23
SLIDE 23

Synchronous Version

Simplest thing that could possibly work.

slide-24
SLIDE 24

Synchronous

Page Page Page

slide-25
SLIDE 25

Synchronous

Page Page

slide-26
SLIDE 26

Synchronous

Page Page

slide-27
SLIDE 27

Synchronous

Page Page

slide-28
SLIDE 28

Synchronous

Page Page

slide-29
SLIDE 29

Synchronous

Page Page

slide-30
SLIDE 30

Synchronous

Page Page

slide-31
SLIDE 31

Synchronous

Page Page

slide-32
SLIDE 32

Synchronous

Page

slide-33
SLIDE 33

Synchronous

Page

slide-34
SLIDE 34

Synchronous

Page

slide-35
SLIDE 35

Synchronous

Page

slide-36
SLIDE 36

Synchronous

Page

slide-37
SLIDE 37

Synchronous

Page

slide-38
SLIDE 38

Synchronous

Page

slide-39
SLIDE 39

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

slide-40
SLIDE 40

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

slide-41
SLIDE 41

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

slide-42
SLIDE 42

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

slide-43
SLIDE 43

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

slide-44
SLIDE 44

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,

  • s.path.join(outdir, filename))

if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

slide-45
SLIDE 45

def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)

slide-46
SLIDE 46

Synchronous Results

slide-47
SLIDE 47

One request

0.003 seconds

Synchronous Results

slide-48
SLIDE 48

One request

0.003 seconds

One billion requests

3,000,000 seconds

Synchronous Results

slide-49
SLIDE 49

833.3 hours

One request

0.003 seconds

One billion requests

3,000,000 seconds

Synchronous Results

slide-50
SLIDE 50

833.3 hours 34.7 days

One request

0.003 seconds

One billion requests

3,000,000 seconds

Synchronous Results

slide-51
SLIDE 51

Multithreading

slide-52
SLIDE 52

Multithreading

List Files can't be parallelized.

queue.Queue

But Get File can be parallelized.

slide-53
SLIDE 53

Multithreading

List Files can't be parallelized.

queue.Queue

But Get File can be parallelized.

slide-54
SLIDE 54

Multithreading

List Files can't be parallelized.

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue

But Get File can be parallelized.

slide-55
SLIDE 55

Multithreading

List Files can't be parallelized.

WorkerThread-1 WorkerThread-2 WorkerThread-3

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue

But Get File can be parallelized.

slide-56
SLIDE 56

Multithreading

List Files can't be parallelized.

WorkerThread-1 WorkerThread-2 WorkerThread-3

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue

But Get File can be parallelized.

slide-57
SLIDE 57

Multithreading

List Files can't be parallelized.

WorkerThread-1 WorkerThread-2 WorkerThread-3

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue

But Get File can be parallelized.

slide-58
SLIDE 58

Multithreading

List Files can't be parallelized.

WorkerThread-1 WorkerThread-2 WorkerThread-3

One thread calls List Files and puts the filenames on a queue.Queue

queue.Queue Results Queue

Result thread prints progress, tracks

  • verall results, failures, etc.
slide-59
SLIDE 59

def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...

slide-60
SLIDE 60

def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...

slide-61
SLIDE 61

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}'

  • utfile = os.path.join(outdir, filename)

work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

slide-62
SLIDE 62

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}'

  • utfile = os.path.join(outdir, filename)

work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

slide-63
SLIDE 63

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

slide-64
SLIDE 64

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

slide-65
SLIDE 65

Multithreaded Results - 10 threads

slide-66
SLIDE 66

One request

0.0036 seconds

Multithreaded Results - 10 threads

slide-67
SLIDE 67

One request

0.0036 seconds

One billion requests

3,600,000 seconds 1000.0 hours 41.6 days

Multithreaded Results - 10 threads

slide-68
SLIDE 68

Multithreaded Results - 100 threads

slide-69
SLIDE 69

One request

0.0042 seconds

Multithreaded Results - 100 threads

slide-70
SLIDE 70

One request

0.0042 seconds

One billion requests

4,200,000 seconds 1166.67 hours 48.6 days

Multithreaded Results - 100 threads

slide-71
SLIDE 71

Why?

Not necessarily IO bound due to low latency and small file size GIL contention, overhead of passing data through queues

slide-72
SLIDE 72

Things to keep in mind

The real code is more complicated, ctrl-c, graceful shutdown, etc. Debugging is much harder, non-deterministic The more you stray from stdlib abstractions, more likely to encounter race conditions Can't use concurrent.futures map() because of large number of files

slide-73
SLIDE 73

Multiprocessing

slide-74
SLIDE 74

Our Task (the details)

We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?

slide-75
SLIDE 75

Multiprocessing

WorkerProcess-1 WorkerProcess-2 WorkerProcess-3

Download one page at a time in parallel across multiple processes

slide-76
SLIDE 76

Multiprocessing

WorkerProcess-1 WorkerProcess-2 WorkerProcess-3

Download one page at a time in parallel across multiple processes

slide-77
SLIDE 77

Multiprocessing

WorkerProcess-1 WorkerProcess-2 WorkerProcess-3

Download one page at a time in parallel across multiple processes

slide-78
SLIDE 78

Multiprocessing

WorkerProcess-1 WorkerProcess-2 WorkerProcess-3

Download one page at a time in parallel across multiple processes

slide-79
SLIDE 79

from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

slide-80
SLIDE 80

from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

slide-81
SLIDE 81

from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Start parallel downloads

slide-82
SLIDE 82

from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Wait for downloads to finish

slide-83
SLIDE 83

def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

slide-84
SLIDE 84

class Downloader: # ... def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status()

  • utfile = os.path.join(self.outdir, filename)

with open(outfile, 'wb') as f: f.write(response.content)

slide-85
SLIDE 85

Multiprocessing Results - 16 processes

slide-86
SLIDE 86

One request

0.00032 seconds

Multiprocessing Results - 16 processes

slide-87
SLIDE 87

One request

0.00032 seconds

One billion requests

320,000 seconds 88.88 hours

Multiprocessing Results - 16 processes

slide-88
SLIDE 88

One request

0.00032 seconds

One billion requests

320,000 seconds 88.88 hours

Multiprocessing Results - 16 processes

3.7 days

slide-89
SLIDE 89

Things to keep in mind

Speed improvements due to truly running in parallel Debugging is much harder, non-deterministic, pdb doesn't work out of the box IPC overhead between processes higher than threads Tradeoff between entirely in parallel vs. parallel chunks

slide-90
SLIDE 90

Asyncio

slide-91
SLIDE 91

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-92
SLIDE 92

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-93
SLIDE 93

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-94
SLIDE 94

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-95
SLIDE 95

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-96
SLIDE 96

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-97
SLIDE 97

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

slide-98
SLIDE 98

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.

slide-99
SLIDE 99

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.

slide-100
SLIDE 100

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.

slide-101
SLIDE 101

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-102
SLIDE 102

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-103
SLIDE 103

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-104
SLIDE 104

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-105
SLIDE 105

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-106
SLIDE 106

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-107
SLIDE 107

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-108
SLIDE 108

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-109
SLIDE 109

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-110
SLIDE 110

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-111
SLIDE 111

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

slide-112
SLIDE 112

Asyncio

Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.

All in a single process All in a single thread Switch tasks when waiting for IO Should keep CPU busy

slide-113
SLIDE 113

import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,

  • s.path.join(outdir, filename))

) await task_queue.put(task)

slide-114
SLIDE 114

import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,

  • s.path.join(outdir, filename))

) await task_queue.put(task)

slide-115
SLIDE 115

import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,

  • s.path.join(outdir, filename))

) await task_queue.put(task)

slide-116
SLIDE 116

import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,

  • s.path.join(outdir, filename))

) await task_queue.put(task)

slide-117
SLIDE 117

import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,

  • s.path.join(outdir, filename))

) await task_queue.put(task)

slide-118
SLIDE 118

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

slide-119
SLIDE 119

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

slide-120
SLIDE 120

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

slide-121
SLIDE 121

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

slide-122
SLIDE 122

Asyncio Results

slide-123
SLIDE 123

One request

0.00056 seconds

Asyncio Results

slide-124
SLIDE 124

One request

0.00056 seconds

One billion requests

560,000 seconds 155.55 hours 6.48 days

Asyncio Results

slide-125
SLIDE 125

Summary

Approach Single Request Time (s) Days Synchronous

0.003 34.7

Multithread

0.0036 41.6

Multiprocess

0.00032 3.7

Asyncio

0.00056 6.5

slide-126
SLIDE 126

Asyncio and Multiprocessing

slide-127
SLIDE 127

Asyncio and Multiprocessing and Multithreading

slide-128
SLIDE 128

WorkerProcess-1

slide-129
SLIDE 129

WorkerProcess-1 Thread-2 Thread-1

slide-130
SLIDE 130

WorkerProcess-1 EventLoop Thread-2 Thread-1

Queue

slide-131
SLIDE 131

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

slide-132
SLIDE 132

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

The Input/Output queues contain pagination tokens

foo

slide-133
SLIDE 133

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

The Input/Output queues contain pagination tokens

foo

slide-134
SLIDE 134

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

foo

The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.

slide-135
SLIDE 135

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

slide-136
SLIDE 136

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

{"FileNames": [...], "NextMarker": "bar"}

slide-137
SLIDE 137

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

slide-138
SLIDE 138

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

slide-139
SLIDE 139

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.

slide-140
SLIDE 140

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

slide-141
SLIDE 141

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

We get to leverage all our cores.

1.

slide-142
SLIDE 142

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

We get to leverage all our cores.

1.

We download individual files efficiently with asyncio.

2.

slide-143
SLIDE 143

WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue

bar

We get to leverage all our cores.

1.

We download individual files efficiently with asyncio.

2.

Minimal IPC overhead, only passing pagination tokens across processes, only

  • ne per thousand files.

3.

slide-144
SLIDE 144

Combo Results

slide-145
SLIDE 145

One request

0.0000303 seconds

Combo Results

slide-146
SLIDE 146

One request

0.0000303 seconds

One billion requests

30,300 seconds

Combo Results

slide-147
SLIDE 147

One request

0.0000303 seconds 8.42 hours

One billion requests

30,300 seconds

Combo Results

slide-148
SLIDE 148

Summary

Approach Single Request Time (s) Days Synchronous

0.003 34.7

Multithread

0.0036 41.6

Multiprocess

0.00032 3.7

Asyncio

0.00056 6.5

Combo

0.0000303 0.35

slide-149
SLIDE 149

Tradeoff between simplicity and speed Multiple orders of magnitude difference based on approach used

Lessons Learned

Need to have max bounds when using queueing or any task scheduling

slide-150
SLIDE 150

Thanks!

J a m e s S a r y e r w i n n i e @ j s a r y e r