SLIDE 1 J a m e s S a r y e r w i n n i e
A case study in multi-threading, multi-processing, and asyncio
Downloading a Billion Files in Python
@ j s a r y e r
SLIDE 2
Our Task
SLIDE 3
Our Task
There is a remote server that stores files
SLIDE 4
Our Task
There is a remote server that stores files The files can be accessed through a REST API
SLIDE 5
Our Task
There is a remote server that stores files The files can be accessed through a REST API Our task is to download all the files on the remote server to our client machine
SLIDE 6
Our Task (the details)
SLIDE 7
Our Task (the details)
What client machine will this run on?
SLIDE 8
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory What client machine will this run on?
SLIDE 9
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory What client machine will this run on? What about the network between the client and server?
SLIDE 10
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server?
SLIDE 11
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files What client machine will this run on? What about the network between the client and server? How many files are on the remote server?
SLIDE 12
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server?
SLIDE 13
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? When do you need this done?
SLIDE 14
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?
SLIDE 15
Files Page Page Page
File Server Rest API
SLIDE 16
File Server Rest API
GET /list
Files Page Page FileNames NextMarker
SLIDE 17
File Server Rest API
GET /list
Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
SLIDE 18
File Server Rest API
GET /list?next-marker=token
Files Page Page FileNames NextMarker
SLIDE 19
File Server Rest API
GET /list?next-marker=token
Files Page Page FileNames NextMarker {"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}
SLIDE 20
File Server Rest API
GET /list GET /get/{filename} {"FileNames": ["file1", "file2", ...]} {"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"} (File blob content) GET /list?next-marker={token}
SLIDE 21
Caveats
This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Sometimes concrete examples can be helpful
SLIDE 22
Caveats
This is a simplified case study. The results shown here don't necessarily generalize. Not an apples to apples comparison, each approach does things slightly different Always profile and test for yourself Sometimes concrete examples can be helpful
SLIDE 23 Synchronous Version
Simplest thing that could possibly work.
SLIDE 24
Synchronous
Page Page Page
SLIDE 25
Synchronous
Page Page
SLIDE 26
Synchronous
Page Page
SLIDE 27
Synchronous
Page Page
SLIDE 28
Synchronous
Page Page
SLIDE 29
Synchronous
Page Page
SLIDE 30
Synchronous
Page Page
SLIDE 31
Synchronous
Page Page
SLIDE 32
Synchronous
Page
SLIDE 33
Synchronous
Page
SLIDE 34
Synchronous
Page
SLIDE 35
Synchronous
Page
SLIDE 36
Synchronous
Page
SLIDE 37
Synchronous
Page
SLIDE 38
Synchronous
Page
SLIDE 39 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 40 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 41 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 42 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 43 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 44 def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url,
- s.path.join(outdir, filename))
if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 45
def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)
SLIDE 46
Synchronous Results
SLIDE 47
One request
0.003 seconds
Synchronous Results
SLIDE 48
One request
0.003 seconds
One billion requests
3,000,000 seconds
Synchronous Results
SLIDE 49
833.3 hours
One request
0.003 seconds
One billion requests
3,000,000 seconds
Synchronous Results
SLIDE 50
833.3 hours 34.7 days
One request
0.003 seconds
One billion requests
3,000,000 seconds
Synchronous Results
SLIDE 51
Multithreading
SLIDE 52 Multithreading
List Files can't be parallelized.
queue.Queue
But Get File can be parallelized.
SLIDE 53 Multithreading
List Files can't be parallelized.
queue.Queue
But Get File can be parallelized.
SLIDE 54 Multithreading
List Files can't be parallelized.
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
SLIDE 55 Multithreading
List Files can't be parallelized.
WorkerThread-1 WorkerThread-2 WorkerThread-3
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
SLIDE 56 Multithreading
List Files can't be parallelized.
WorkerThread-1 WorkerThread-2 WorkerThread-3
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
SLIDE 57 Multithreading
List Files can't be parallelized.
WorkerThread-1 WorkerThread-2 WorkerThread-3
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue
But Get File can be parallelized.
SLIDE 58 Multithreading
List Files can't be parallelized.
WorkerThread-1 WorkerThread-2 WorkerThread-3
One thread calls List Files and puts the filenames on a queue.Queue
queue.Queue Results Queue
Result thread prints progress, tracks
- verall results, failures, etc.
SLIDE 59
def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...
SLIDE 60
def download_files(host, port, outdir, num_threads): # ... same constants as before ... work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE) threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread) # ...
SLIDE 61 response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}'
- utfile = os.path.join(outdir, filename)
work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 62 response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}'
- utfile = os.path.join(outdir, filename)
work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 63
def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
SLIDE 64
def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)
SLIDE 65
Multithreaded Results - 10 threads
SLIDE 66
One request
0.0036 seconds
Multithreaded Results - 10 threads
SLIDE 67
One request
0.0036 seconds
One billion requests
3,600,000 seconds 1000.0 hours 41.6 days
Multithreaded Results - 10 threads
SLIDE 68
Multithreaded Results - 100 threads
SLIDE 69
One request
0.0042 seconds
Multithreaded Results - 100 threads
SLIDE 70
One request
0.0042 seconds
One billion requests
4,200,000 seconds 1166.67 hours 48.6 days
Multithreaded Results - 100 threads
SLIDE 71
Why?
Not necessarily IO bound due to low latency and small file size GIL contention, overhead of passing data through queues
SLIDE 72
Things to keep in mind
The real code is more complicated, ctrl-c, graceful shutdown, etc. Debugging is much harder, non-deterministic The more you stray from stdlib abstractions, more likely to encounter race conditions Can't use concurrent.futures map() because of large number of files
SLIDE 73
Multiprocessing
SLIDE 74
Our Task (the details)
We have one machine we can use, 16 cores, 64GB memory Our client machine is on the same network as the service with remote files Approximately one billion files, 100 bytes per file What client machine will this run on? What about the network between the client and server? How many files are on the remote server? Please have this done as soon as possible When do you need this done?
SLIDE 75 Multiprocessing
WorkerProcess-1 WorkerProcess-2 WorkerProcess-3
Download one page at a time in parallel across multiple processes
SLIDE 76 Multiprocessing
WorkerProcess-1 WorkerProcess-2 WorkerProcess-3
Download one page at a time in parallel across multiple processes
SLIDE 77 Multiprocessing
WorkerProcess-1 WorkerProcess-2 WorkerProcess-3
Download one page at a time in parallel across multiple processes
SLIDE 78 Multiprocessing
WorkerProcess-1 WorkerProcess-2 WorkerProcess-3
Download one page at a time in parallel across multiple processes
SLIDE 79
from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
SLIDE 80
from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
SLIDE 81
from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
Start parallel downloads
SLIDE 82
from concurrent import futures def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()
Wait for downloads to finish
SLIDE 83
def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)
SLIDE 84 class Downloader: # ... def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status()
- utfile = os.path.join(self.outdir, filename)
with open(outfile, 'wb') as f: f.write(response.content)
SLIDE 85
Multiprocessing Results - 16 processes
SLIDE 86
One request
0.00032 seconds
Multiprocessing Results - 16 processes
SLIDE 87
One request
0.00032 seconds
One billion requests
320,000 seconds 88.88 hours
Multiprocessing Results - 16 processes
SLIDE 88
One request
0.00032 seconds
One billion requests
320,000 seconds 88.88 hours
Multiprocessing Results - 16 processes
3.7 days
SLIDE 89
Things to keep in mind
Speed improvements due to truly running in parallel Debugging is much harder, non-deterministic, pdb doesn't work out of the box IPC overhead between processes higher than threads Tradeoff between entirely in parallel vs. parallel chunks
SLIDE 90
Asyncio
SLIDE 91 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 92 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 93 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 94 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 95 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 96 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 97 Asyncio
Create an asyncio.Task for each file. This immediately starts the download.
SLIDE 98 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
SLIDE 99 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
SLIDE 100 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks.
SLIDE 101 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 102 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 103 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 104 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 105 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 106 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 107 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 108 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 109 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 110 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 111 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
SLIDE 112 Asyncio
Create an asyncio.Task for each file. This immediately starts the download. Move on to the next page and start creating tasks. Meanwhile tasks from the first page will finish downloading their file.
All in a single process All in a single thread Switch tasks when waiting for IO Should keep CPU busy
SLIDE 113 import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,
- s.path.join(outdir, filename))
) await task_queue.put(task)
SLIDE 114 import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,
- s.path.join(outdir, filename))
) await task_queue.put(task)
SLIDE 115 import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,
- s.path.join(outdir, filename))
) await task_queue.put(task)
SLIDE 116 import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,
- s.path.join(outdir, filename))
) await task_queue.put(task)
SLIDE 117 import asyncio from aiohttp import ClientSession import uvloop async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url,
- s.path.join(outdir, filename))
) await task_queue.put(task)
SLIDE 118
async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
SLIDE 119
async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())
SLIDE 120
async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
SLIDE 121
async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename
SLIDE 122
Asyncio Results
SLIDE 123
One request
0.00056 seconds
Asyncio Results
SLIDE 124
One request
0.00056 seconds
One billion requests
560,000 seconds 155.55 hours 6.48 days
Asyncio Results
SLIDE 125
Summary
Approach Single Request Time (s) Days Synchronous
0.003 34.7
Multithread
0.0036 41.6
Multiprocess
0.00032 3.7
Asyncio
0.00056 6.5
SLIDE 126
Asyncio and Multiprocessing
SLIDE 127
Asyncio and Multiprocessing and Multithreading
SLIDE 128
WorkerProcess-1
SLIDE 129
WorkerProcess-1 Thread-2 Thread-1
SLIDE 130
WorkerProcess-1 EventLoop Thread-2 Thread-1
Queue
SLIDE 131 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
SLIDE 132 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
The Input/Output queues contain pagination tokens
foo
SLIDE 133 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
The Input/Output queues contain pagination tokens
foo
SLIDE 134 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
foo
The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.
SLIDE 135 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
foo
The event loop makes the List call with the provided pagination token "foo".
SLIDE 136 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
foo
The event loop makes the List call with the provided pagination token "foo".
{"FileNames": [...], "NextMarker": "bar"}
SLIDE 137 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
The next pagination token "bar", eventually makes its way back to the main process.
SLIDE 138 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
The next pagination token "bar", eventually makes its way back to the main process.
SLIDE 139 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.
SLIDE 140 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
SLIDE 141 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
We get to leverage all our cores.
1.
SLIDE 142 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
We get to leverage all our cores.
1.
We download individual files efficiently with asyncio.
2.
SLIDE 143 WorkerProcess-1 EventLoop Thread-2 Thread-1 Queue WorkerProcess-2 EventLoop Thread-2 Thread-1 Queue Main process Input Queue Output Queue
bar
We get to leverage all our cores.
1.
We download individual files efficiently with asyncio.
2.
Minimal IPC overhead, only passing pagination tokens across processes, only
3.
SLIDE 144
Combo Results
SLIDE 145
One request
0.0000303 seconds
Combo Results
SLIDE 146
One request
0.0000303 seconds
One billion requests
30,300 seconds
Combo Results
SLIDE 147
One request
0.0000303 seconds 8.42 hours
One billion requests
30,300 seconds
Combo Results
SLIDE 148
Summary
Approach Single Request Time (s) Days Synchronous
0.003 34.7
Multithread
0.0036 41.6
Multiprocess
0.00032 3.7
Asyncio
0.00056 6.5
Combo
0.0000303 0.35
SLIDE 149
Tradeoff between simplicity and speed Multiple orders of magnitude difference based on approach used
Lessons Learned
Need to have max bounds when using queueing or any task scheduling
SLIDE 150 Thanks!
J a m e s S a r y e r w i n n i e @ j s a r y e r