Parallelism Made Simple: A Practical Deep Dive into Python's concurrent.futures
Learn how to run work in parallel with Python's concurrent.futures — when to reach for threads vs. processes, how to use ThreadPoolExecutor and ProcessPoolExecutor, and how to handle results, exceptions, and timeouts cleanly.
Sooner or later every Python program hits a wall where doing one thing at a time is simply too slow. Maybe you're downloading a hundred URLs, resizing a folder full of images, or crunching a batch of CPU-heavy calculations. The obvious question is: can't we just do several of these at once? The answer is yes, and the cleanest way to do it in the standard library is concurrent.futures.
The concurrent.futures module gives you a single, high-level API for running work in parallel — whether that work is best served by threads or by separate processes. You don't have to manually create thread objects, manage a queue, or join workers by hand. You submit callables, you get futures back, and you collect results when they're ready. This article walks through the whole model, when to choose threads versus processes, and the pitfalls that trip people up.
Threads vs. processes: the one decision that matters
Before writing any code, you need to understand the single most important design choice: are you I/O-bound or CPU-bound?
I/O-bound work spends most of its time waiting — for a network response, a disk read, a database query. The CPU sits idle during the wait. Here, threads shine. While one thread waits on a socket, another can run. Python's Global Interpreter Lock (GIL) prevents two threads from executing Python bytecode at the exact same instant, but that barely matters when your threads are mostly waiting rather than computing.
CPU-bound work keeps the processor busy — hashing, parsing, number crunching, image processing. Here threads don't help, because the GIL serializes the actual computation. You need separate processes, each with its own Python interpreter and its own GIL, so they can run truly in parallel across multiple cores.
The beauty of concurrent.futures is that both cases share almost identical code. You pick ThreadPoolExecutor or ProcessPoolExecutor and the rest looks the same.
ThreadPoolExecutor: the workhorse for I/O
An executor is a pool of workers you hand tasks to. The simplest way to use it is map(), which behaves like the built-in map but runs the calls concurrently and returns results in the original input order.
import time
from concurrent.futures import ThreadPoolExecutor
def fetch(n):
time.sleep(0.05) # pretend this is a network call
return n * n
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(fetch, range(6)))
print(results) # [0, 1, 4, 9, 16, 25]
Six tasks that each sleep 50 ms would take 300 ms run serially. With four workers they overlap and finish in roughly two batches. Using the executor as a context manager (with ... as executor) is idiomatic: it automatically waits for all pending work to finish and cleans up the pool when the block exits.
submit() and futures
While map is convenient, it couples you to input order and a single function. For more control, use submit(), which schedules one callable and immediately returns a Future — a handle representing a result that may not exist yet. You call .result() to block until it's ready.
from concurrent.futures import ThreadPoolExecutor, as_completed
def work(n):
time.sleep(0.01 * n)
return n, n ** 2
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {executor.submit(work, n): n for n in range(5)}
for future in as_completed(futures):
n, squared = future.result()
print(f"{n} squared is {squared}")
The key helper here is as_completed(). Instead of waiting for every task, it yields each future the moment it finishes, in completion order rather than submission order. That means fast tasks get processed while slow ones are still running — ideal when you want to stream results or update a progress bar as work lands.
Handling exceptions without losing your mind
A common surprise: if a worker function raises an exception, the pool doesn't crash and the traceback doesn't print. The exception is stored inside the future and re-raised only when you call .result(). This is actually a feature — it lets you handle failures per task instead of letting one bad input kill the whole batch.
def risky(n):
if n == 3:
raise ValueError(f"cannot process {n}")
return n * 10
with ThreadPoolExecutor() as executor:
futures = [executor.submit(risky, n) for n in range(5)]
for future in futures:
try:
print("ok:", future.result())
except ValueError as exc:
print("failed:", exc)
Because the exception surfaces at .result(), wrap that call — not the submit() — in your try/except. This pattern lets you collect successes and failures side by side and decide what to retry.
Timeouts
Both .result() and as_completed() accept a timeout in seconds. If the result isn't ready in time, a concurrent.futures.TimeoutError is raised, giving you an escape hatch for tasks that hang.
from concurrent.futures import ThreadPoolExecutor, TimeoutError
def slow():
time.sleep(1)
return "done"
with ThreadPoolExecutor() as executor:
future = executor.submit(slow)
try:
print(future.result(timeout=0.1))
except TimeoutError:
print("gave up waiting")
One caveat worth knowing: a timeout stops you from waiting, but it does not forcibly kill the running task. The worker keeps going in the background until it completes on its own. There is no clean, universal way to cancel a thread mid-flight in Python, so design long tasks to check for a stop signal if you need real cancellation.
ProcessPoolExecutor: real parallelism for CPU work
When the bottleneck is computation, swap in ProcessPoolExecutor. The API is identical, but each task runs in a separate process on a separate core, sidestepping the GIL entirely.
import math
from concurrent.futures import ProcessPoolExecutor
def is_prime(n):
if n < 2:
return False
for i in range(2, math.isqrt(n) + 1):
if n % i == 0:
return False
return True
NUMBERS = [112272535095293, 112582705942171, 115280095190773]
if __name__ == "__main__":
with ProcessPoolExecutor() as executor:
for number, prime in zip(NUMBERS, executor.map(is_prime, NUMBERS)):
print(f"{number}: prime={prime}")
Notice the if __name__ == "__main__": guard. It is not optional for process pools. When Python spawns a child process it re-imports your module, and without the guard that import would recursively try to spawn more pools. Always put the code that launches a ProcessPoolExecutor behind that guard (this matters especially on Windows and macOS, where the default start method is spawn).
The picklability constraint
Because arguments and return values have to travel between processes, they are serialized with pickle. That means the function you submit and the data you pass must be picklable. Module-level functions work fine; lambdas, locally-defined closures, and open file handles do not. If you see a PicklingError, this is almost always why — move the function to module level.
Common pitfalls
Reaching for processes when you're I/O-bound. Spawning processes has real overhead — memory, startup cost, and the pickling round-trip. If your tasks are network calls, threads are lighter and usually faster. Match the executor to the workload.
Sharing mutable state across threads. Threads share memory, so two threads updating the same list or counter can corrupt it. Prefer returning values from your worker and combining them in the main thread, or protect shared state with a threading.Lock.
Oversizing the pool. More workers is not always faster. For CPU work, a good default is the number of cores (which ProcessPoolExecutor uses automatically). For I/O work you can go higher, but past a point you'll just add context-switching and connection-limit pain. Measure before you tune.
Forgetting that map preserves order but hides errors until iteration. With executor.map, an exception in any task is re-raised when you iterate to that result. If you need per-task error isolation, use submit plus as_completed instead.
Wrap-up and next steps
concurrent.futures gives you a remarkably small surface area for a genuinely powerful capability. Remember the core recipe: choose ThreadPoolExecutor for I/O-bound work and ProcessPoolExecutor for CPU-bound work; use map when you want ordered results from one function and submit + as_completed when you want streaming results and per-task error handling; and always retrieve results through .result() so exceptions surface where you can handle them.
From here, a few natural next steps: profile a real workload to confirm whether you're actually I/O- or CPU-bound before parallelizing; explore Executor.shutdown(wait=False, cancel_futures=True) for graceful early exits; and if your work is I/O-bound and you're building something from scratch, compare this thread-based approach with asyncio to see which model fits your codebase better. Start small — wrap one slow loop in a ThreadPoolExecutor and measure the difference. The speedups are often immediate and satisfying.