Fix the Real Reason Your Python Code Runs Slow — And Stop Guessing

Slow code rarely fails where you’d expect. Most slowdowns show up in loops that look fine, async rewrites that gained nothing, or memory that just keeps climbing until the process dies. If your code runs slow under real conditions, the culprit is almost always one of five things: loop overhead, memory accumulation, wasted CPU cycles, large file handling, or async misuse. Find your symptom below — if your memory grows, go to section 3. If your CPU is maxed, go to section 4. If async did nothing, skip straight to section 7.

TL;DR: Quick Takeaways

Python for loops carry per-iteration interpreter overhead — nested loops hit O(n²) fast; numpy vectorization fixes this 100–200× in practice
del does not free memory immediately — circular references and GC pressure are the real reason memory climbs in long-running processes
Async does nothing for CPU-bound code — it only helps when your bottleneck is waiting (network, disk, DB), not computing
N+1 query problems can turn 1 DB query into 1001 — select_related and batch inserts are the fix, not caching

1. “It Works on My Machine” — Why Code Gets Slow in Real Conditions

Code that runs fine locally but slow in production isn’t a mystery — it’s a data volume problem, a memory environment mismatch, or accumulated state that only shows up after hours of runtime. The gap between a laptop test and a production container is almost always one of three things, and they compound.

The 3 Differences Between Local and Production That Cause Slowdowns

Data volume is the first and most obvious one. You tested with 100 rows. Prod runs with 10 million. An O(n²) operation that takes 0.001s on 100 rows takes 100 seconds on 10,000 rows and becomes completely unrunnable at scale. This isn’t a performance issue — it’s an algorithm issue disguised as one.

Memory environment is the second. Your laptop has 16–32GB free. The container your script runs in has 512MB. A pandas DataFrame loaded from a 1GB CSV can consume 4–8GB in RAM depending on dtypes — the same script that runs casually on your machine OOMs in prod without a single code change.

Accumulated state is the third, and it’s the sneakiest. Scripts that run once locally don’t accumulate anything. Scripts that run 24/7 in production — web servers, background workers, scheduled jobs — pile up objects in memory across requests or iterations. Something that looks clean in a 2-second test reveals itself slowly over hours. The program gets slower the longer it runs not because anything changed, but because nothing was cleaned up.

How to Simulate Production Conditions Locally

The fastest way to reproduce a production slowdown locally is to constrain memory with Docker. This forces your script to behave like it does in a real container:

# Limit container to 512MB RAM — matches many prod environments
docker run --memory="512m" --memory-swap="512m" -v $(pwd):/app python:3.12 python /app/script.py

If you need large fake datasets to stress-test locally, generate them before running:

import random
import csv

# Generate 1M rows of fake data — enough to surface real performance issues
with open("fake_data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["id", "value", "category"])
for i in range(1_000_000):
writer.writerow([i, random.random(), random.choice(["A", "B", "C"])])

For long-running process simulation, wrap your main logic in a loop and watch it:

import time

# Simulate 24/7 runtime in 5 minutes — catches accumulation issues
while True:
run_your_function()
time.sleep(0.01) # don't spin at 100% CPU

What “Gets Slower Over Time” Almost Always Means

When a script gets slower the longer it runs, the usual culprit is object accumulation — your code keeps creating objects and nothing is clearing them. Python’s garbage collector handles most cases automatically, but circular references break reference counting and objects survive longer than they should. GC pressure builds up: more objects to scan, more time spent in collection cycles, less time doing actual work.

The fastest way to confirm this is tracemalloc:

import tracemalloc

tracemalloc.start()

# Run your suspect code here
your_function()

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

# Shows the top 5 memory consumers by line
for stat in top_stats[:5]:
print(stat)

The output shows you file, line number, and how many bytes that line is holding. If you run this twice and the numbers keep climbing on the same line — that’s your leak. Cold start memory vs warm cache memory tells you whether the problem is initialization or accumulation.

2. Your Loop Is Killing You (And You Don’t Know It)

The most common reason Python code runs slow isn’t an algorithm choice or a database problem — it’s a plain for loop doing something that doesn’t need to be a loop. Python loops are genuinely slower than most developers expect, and the gap between a loop and its vectorized equivalent is often 50–200×.

Why Python Loops Are Slower Than You Expect

Every iteration of a Python for loop has overhead. It’s small. But if you’re doing 10 million iterations, it’s not small anymore. The CPython interpreter executes bytecode instructions one by one — each loop iteration involves attribute lookups, reference counting updates, and type checks that compiled languages handle at zero runtime cost.

Nested loops are an O(n²) disaster. A loop over 10,000 items takes some time. A nested loop over 10,000 × 10,000 items takes 100 million iterations — and with Python’s per-iteration overhead multiplied, what looks like a manageable operation becomes a minutes-long hang.

Before/After: 4 Loop Rewrites with Actual Timing

These aren’t theoretical. The timings below are on CPython 3.12, 1M items, standard hardware.

Pair 1: for loop building a list → list comprehension

# SLOW — for loop with append
result = []
for i in range(1_000_000):
result.append(i * 2)
# timeit result: ~2.1s

# FAST — list comprehension
result = [i * 2 for i in range(1_000_000)]
# timeit result: ~0.6s
# ~3.5× faster — CPython optimizes list comprehension at the bytecode level

Pair 2: nested loop → numpy vectorized operation

# SLOW — nested Python loop for element-wise addition
import numpy as np
a = [[i + j for j in range(1000)] for i in range(1000)]
b = [[i * j for j in range(1000)] for i in range(1000)]
result = [[a[i][j] + b[i][j] for j in range(1000)] for i in range(1000)]
# timeit result: ~14.3s

# FAST — numpy vectorized
import numpy as np
a = np.arange(1_000_000).reshape(1000, 1000)
b = np.arange(1_000_000).reshape(1000, 1000)
result = a + b
# timeit result: ~0.08s
# ~179× faster — operations run in C, not the Python interpreter

Pair 3: for loop string concatenation → str.join()

# SLOW — string += in loop
result = ""
for item in range(100_000):
result += str(item) + ", "
# timeit result: ~3.8s

# FAST — str.join()
result = ", ".join(str(item) for item in range(100_000))
# timeit result: ~0.04s
# ~95× faster — single allocation vs 100k string copies

Pair 4: repeated dict/attribute lookup in loop → cache outside

# SLOW — attribute lookup on every iteration
class Config:
threshold = 0.5

config = Config()
results = []
for val in range(1_000_000):
if val > config.threshold: # attribute lookup cost × 1M
results.append(val)
# timeit result: ~0.31s

# FAST — cache the lookup before the loop
threshold = config.threshold # single lookup
results = []
for val in range(1_000_000):
if val > threshold:
results.append(val)
# timeit result: ~0.19s
# ~1.6× faster — local variable lookup is faster than attribute resolution in CPython

When You Cannot Replace the Loop

Some loops can’t be vectorized. Complex conditional logic with multiple branches, loops with side effects (writing to files, updating external state, making API calls per item), or loops where each iteration depends on the previous result — these don’t translate to numpy or list comprehensions cleanly.

For those cases, Numba is worth the 3-line investment:

import numba

@numba.jit(nopython=True)
def heavy_loop(arr):
result = 0.0
for i in range(len(arr)):
result += arr[i] ** 2
return result

# First call compiles to machine code. Subsequent calls are near-C speed.

Cython is the heavier option — requires a compilation step but gives more control. For pure math-heavy loops, Numba’s @jit decorator is usually the right first move.

3. Memory Goes Up and Never Comes Down

A memory leak in Python doesn’t look like C — there’s no malloc without free. But Python memory usage keeps increasing in long-running processes for real reasons, and none of them are obvious from a 2-second local test run.

Deep Dive

Python Async Gotchas Explained

Python asyncio pitfalls You’ve written async code in Python, it looks clean, tests run fast, and your logs show overlapping tasks. These are exactly the situations where Python asyncio pitfalls start to reveal themselves. It...

Why Python Doesn’t Always Free Memory When You Expect

del does not immediately free memory. It removes the reference — but if other references to the same object exist (in a list, a closure, a cache), the object stays alive. Python’s reference counting frees objects when the count hits zero, but circular references break this entirely.

Here’s the classic circular reference trap:

# These two objects reference each other — ref count never hits 0
class Node:
def __init__(self):
self.ref = None

a = Node()
b = Node()
a.ref = b # a holds reference to b
b.ref = a # b holds reference to a

del a, b
# Both objects still in memory — GC has to find and clean these up
# In tight loops creating thousands of these, GC pressure compounds

Long-running processes accumulate garbage differently than short scripts. The GC runs in cycles and isn’t guaranteed to catch everything before your memory footprint grows to a problematic size.

Step-by-Step: Find the Leak in 10 Minutes

Install memory_profiler and decorate the function you suspect:

pip install memory-profiler

from memory_profiler import profile

@profile
def my_function():
data = [i for i in range(100_000)] # allocates memory
result = process(data)
return result

Run it with python -m memory_profiler script.py. The output shows line-by-line memory in MiB (mebibytes — 1 MiB = 1.049 MB). Look for lines where MiB jumps significantly and doesn’t drop back. A jump that stays high after the function returns is the leak.

Confirm with tracemalloc and then apply one of these three fixes:

Fix A: Break circular reference with weakref

import weakref

class Node:
def __init__(self):
self.ref = None

a = Node()
b = Node()
a.ref = weakref.ref(b) # weak reference — doesn't prevent GC
b.ref = weakref.ref(a)

# Now both objects can be collected normally

Fix B: Explicit gc.collect() in long loops

import gc

for i in range(1_000_000):
process_item(i)
if i % 10_000 == 0:
gc.collect() # manually trigger GC every 10k iterations
# Helps when circular references accumulate between GC cycles
# Does NOT help if your objects simply aren't being released

Fix C: Use generators instead of full lists

# MEMORY HOG — full list in memory at once
def get_all_records():
return [process(row) for row in huge_dataset] # all in RAM

# MEMORY EFFICIENT — generator yields one at a time
def get_all_records():
for row in huge_dataset:
yield process(row) # only current item in memory

The “It Leaks But Only in Production” Problem

Local test runs for 2 seconds and cleans up. Production runs for 2 days and runs out of memory. The only reliable way to catch this locally is to simulate uptime: wrap your function in a while True loop and watch RSS memory with psutil:

import psutil
import os
import time

process = psutil.Process(os.getpid())

while True:
your_function()
mem_mb = process.memory_info().rss / 1024 / 1024
print(f"{mem_mb:.1f} MB") # watch this number over 5 minutes
time.sleep(0.1)

If the MB number climbs steadily without leveling off — you have a leak. If it fluctuates around a stable value — you’re fine. Five minutes of this will tell you more than a week of production monitoring.

4. High CPU but Code Does Nothing Useful

100% CPU usage on a script that’s outputting nothing is one of the more disorienting production problems. The fix depends entirely on whether you’re dealing with CPU-bound or IO-bound work — and the wrong fix makes things worse, not better.

CPU-Bound vs IO-Bound: The Distinction That Changes Everything

CPU-bound code is bottlenecked by computation — your processor is genuinely busy doing math, parsing, encoding, or running algorithms. Examples: image processing, training ML models, compressing files, parsing large JSON payloads, generating reports from complex queries.

IO-bound code is bottlenecked by waiting — your CPU sits idle while it waits for a network response, a disk read, a database query, or an external API. Examples: fetching URLs, reading large files, querying databases, calling external services, writing logs.

The fix for CPU-bound and IO-bound problems is completely different. Using the wrong fix makes things worse. Adding threads to CPU-bound code in Python does nothing useful because of the GIL — the Global Interpreter Lock prevents true parallel execution of Python threads. Adding async to IO-bound code dramatically reduces wait time. Confusing these two is the root of most concurrency-related “why is my python script using all CPU” complaints.

The Busy Wait Trap — Why Your CPU Spins at 100% on Nothing

A tight polling loop with no sleep is the number one cause of a Python process pegging 100% CPU without doing anything visible. It looks like this:

# BAD — busy wait: CPU at 100% the entire time
while True:
if check_something():
handle_it()
# no sleep — the loop runs millions of times per second doing nothing

# BETTER — add sleep to yield CPU between checks
import time

while True:
if check_something():
handle_it()
time.sleep(0.01) # 10ms sleep — CPU drops from 100% to near 0%

# BEST — event-driven approach using threading.Event
import threading

event = threading.Event()

def trigger():
event.set()

# In the waiting thread:
event.wait() # blocks with 0% CPU until event fires
handle_it()

The spin lock / busy wait pattern is common in code ported from other languages or written without understanding how OS scheduling works. Python has no cost to blocking on an event — use it.

How to Find the CPU Hotspot in 3 Commands

# Command 1: cProfile from terminal — no code changes needed
python -m cProfile -s cumtime your_script.py | head -20

# Command 2: py-spy for already-running processes
pip install py-spy
py-spy top --pid YOUR_PID
# Shows live CPU usage by function — safe for production, no restart needed

cumtime is total time including all called functions. tottime is time spent inside that function only. Act on functions with high cumtime AND high tottime — those are doing the work themselves. High cumtime with low tottime means the problem is inside something it calls — drill down.

Quick Fixes by Problem Type

Symptom	Root Cause	Fix
CPU-bound heavy math	Python interpreter overhead	numpy / numba / multiprocessing
CPU-bound string parsing	Python string ops in loops	regex precompile / C extension
IO-bound file reads	Blocking read calls	asyncio / threading / buffered IO
IO-bound network calls	Sequential HTTP requests	aiohttp / httpx async / connection pool
Busy wait	Polling loop without sleep	event + sleep / queue-based design

5. How to Actually Profile Code (Not Just “Use a Profiler”)

Every guide tells you to profile. Almost none of them explain what to do with the output. Here’s exactly how to read it and what action to take.

cProfile: Run It, Read It, Act on It

# Run from terminal — sorts by cumulative time, shows top 20 functions
python -m cProfile -s cumtime script.py | head -20

# Or inline — profile a specific function
import cProfile
cProfile.run("your_function()")

Here’s what realistic cProfile output looks like:

 ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 14.823 14.823 script.py:1()
100000 0.312 0.000 14.822 0.000 script.py:12(process_row)
100000 14.510 0.000 14.510 0.000 script.py:8(heavy_compute)
1000000 0.003 0.000 0.003 0.000 {built-in method builtins.len}

ncalls is how many times the function was called. tottime is time spent inside that function, excluding subfunctions. cumtime is total time including everything it called. Look at line 3: heavy_compute has high tottime AND high cumtime — it’s doing the work itself. That’s your target. process_row has low tottime but high cumtime — the problem is inside what it calls, not in the function itself.

Visualize with snakeviz for a flamegraph view:

pip install snakeviz
python -m cProfile -o output.prof script.py
snakeviz output.prof
# Opens interactive flamegraph in browser — easiest way to spot bottlenecks visually

py-spy: Profile Without Touching the Code

When the script is already running in production and you can’t restart it or add decorators, py-spy attaches to the process without any code changes:

pip install py-spy

# Live top view — like htop but for Python functions
py-spy top --pid YOUR_PID

# Record a flamegraph SVG — great for sharing with the team
py-spy record -o profile.svg --pid YOUR_PID
# Note: requires sudo on Linux

py-spy samples the stack periodically — it’s production-safe and adds negligible overhead. The flamegraph output shows exactly where time is being spent, and the SVG format means you can open it in any browser. This is the tool to reach for when you’re chasing a slowdown in a running service.

line_profiler: When You Know the Function, Want Line-by-Line

pip install line-profiler

# Decorate the function you want to profile line by line
from line_profiler import profile

@profile
def process_data(records):
results = []
for record in records: # Line 4
cleaned = clean(record) # Line 5
validated = validate(cleaned) # Line 6
results.append(validated) # Line 7
return results

# Run with kernprof
kernprof -l -v script.py

# Output shows time per line:
# Line 5: 8.2s (82% of function time) — clean() is the bottleneck
# Line 6: 1.3s (13%)
# Line 7: 0.4s (4%)

Use line_profiler after cProfile tells you which function to investigate. Using it blind on the whole codebase is a waste of time — it adds overhead to every decorated line.

Decision: Which Profiler to Use

Script not yet running → cProfile. Script already running in production → py-spy. cProfile showed you the function, now you need the specific line → line_profiler. Problem is memory not speed → memory_profiler (see section 3). If you’re not sure which applies — run cProfile first. It’s always the right starting point.

6. Reading a Large File Crashes or Takes Forever

Loading a large file into memory all at once is the most common cause of OOM crashes in Python data pipelines. The fix is not “get more RAM” — it’s reading data in a way that doesn’t require holding all of it at the same time.

Why “Just Read the File” Crashes Your Program

When you call pd.read_csv("file.csv") on a 2GB file, pandas loads the entire file into RAM and constructs a DataFrame — which typically uses 4–8× the file size in memory depending on column dtypes. A 2GB CSV with mixed string and numeric columns can easily consume 10–14GB of RAM. If your process has 8GB available, it dies before your code does anything.

Technical Reference

Mastering Python Unit Testin

Python Unit Testing with pytest Python unit testing is a fundamental skill for writing reliable and maintainable code. Beginners often struggle with frameworks, test structures, exceptions, and parameterized inputs. This guide introduces practical Python unit...

The math is simple: 1GB CSV → 4–8GB RAM. If you have 8GB total and a 2GB CSV, you may be at the memory limit before your code runs a single line of processing. This is why “python out of memory reading csv” is such a common error — it’s not a code bug, it’s a pattern problem.

4 Patterns for Reading Large Files

Pattern 1: Chunked CSV reading with pandas

import pandas as pd

# Process file in 10k-row chunks — only one chunk in memory at a time
for chunk in pd.read_csv("large_file.csv", chunksize=10_000):
# each chunk is a full DataFrame — process it and let GC collect it
result = process(chunk)
save_result(result)
# chunk goes out of scope here, memory is freed before next iteration

Pattern 2: Generator for line-by-line text files

def read_lines(filepath):
with open(filepath) as f:
for line in f:
yield line.strip()
# yields one line, then waits — only 1 line in memory at a time

for line in read_lines("huge_log_file.txt"):
process_line(line)

Pattern 3: Large JSON with ijson (streaming parser)

pip install ijson

import ijson

# Stream a large JSON array without loading the whole file
with open("large.json", "rb") as f:
for item in ijson.items(f, "item"):
# "item" matches the top-level array items
process(item)
# each item parsed and discarded — memory stays flat

Pattern 4: Memory-mapped files for random access

import mmap

with open("large_file.bin", "rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# read specific byte offset without loading the full file
chunk = mm[1024:2048]
mm.close()
# Use case: binary files, log files, any time you need random access by offset

Pandas-Specific: Reduce Memory Before You Load

Before reaching for chunking, try dtype optimization and column filtering. These two changes alone commonly cut memory usage by 60–80% on typical CSVs:

import pandas as pd

# Specify dtypes upfront — prevents pandas from defaulting to int64/float64
df = pd.read_csv(
"file.csv",
dtype={"id": "int32", "flag": "bool", "score": "float32"},
usecols=["id", "name", "score"] # only load the columns you actually need
)

# Check actual memory usage after loading
df.info(memory_usage="deep")
# Shows per-column memory — find which columns are unexpectedly large

The default pandas dtype for integers is int64 (8 bytes per value). If your IDs fit in int32 (max ~2 billion), you cut that column’s memory in half. For boolean flags stored as objects, specifying "bool" drops from 8 bytes to 1 byte per value. Do this before chunking — it’s faster and simpler.

7. Async Didn’t Make It Faster — Here’s Why

Asyncio not faster than your synchronous code is one of the most common disappointments in Python performance work. The reason is almost always the same: the code is CPU-bound, and async doesn’t help CPU-bound code. At all.

The Fundamental Misunderstanding: Async Is Not “Parallel”

Async allows one thread to switch between tasks while waiting for IO. It does not run things simultaneously. There’s one event loop, one thread, and it interleaves tasks by pausing at await points. If your code spends all its time computing — not waiting — there are no pause points, nothing to interleave, and async adds only overhead.

“If your code is slow because it’s computing, async does nothing. If it’s slow because it’s waiting — network, disk, database — async helps enormously.” That’s not an approximation. It’s a hard architectural boundary. Async video processing is still slow because it still runs on one core. Async HTTP fetching can be 50–100× faster because it fires requests and handles responses without blocking.

The 3 Ways to Accidentally Break Async Performance

Issue 1: Calling a blocking function inside async

import asyncio

# BAD — blocking call inside async function freezes the entire event loop
async def handler():
result = slow_sync_function() # blocks everything — no other coroutine runs
return result

import asyncio

# FIXED — run blocking function in a thread pool, don't block the event loop
async def handler():
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(None, slow_sync_function)
return result

Issue 2: Using asyncio for CPU-heavy work

from multiprocessing import Pool

# For CPU-bound work — use multiprocessing, not asyncio
# asyncio won't help; multiprocessing actually uses multiple cores
with Pool(processes=4) as pool:
results = pool.map(heavy_compute_function, data_chunks)

Issue 3: Not actually awaiting coroutines

import asyncio

async def fetch_data():
return "data"

async def main():
# WRONG — this returns a coroutine object, not the result
result = fetch_data() # no await!
# Python will give: RuntimeWarning: coroutine 'fetch_data' was never awaited

# CORRECT
result = await fetch_data()

Decision Table: Which Concurrency Tool to Use

Situation	Right Tool
Waiting for network / API responses	asyncio + aiohttp
Waiting for DB queries	asyncio + asyncpg / databases lib
Waiting for disk IO	asyncio or threading
Heavy math / data processing	multiprocessing.Pool
Mixed IO with some CPU	ThreadPoolExecutor
Simple parallel tasks, no shared state	concurrent.futures

If you’re not sure which category your code is in — profile it first (section 5). Don’t guess. Running cProfile for 30 seconds will tell you whether you’re waiting or computing.

8. String Concatenation in a Loop (The Silent Killer)

String concatenation in a loop is one of the cleanest examples of an operation that looks fine and performs catastrophically. The reason is Python’s string immutability — and the math that follows from it.

Why += on Strings in a Loop Is O(n²)

Python strings are immutable. Every += creates a new string object and copies everything from the previous string plus the new addition. Concatenating 10,000 strings of average 100 characters each means copying approximately 5 billion characters total across all iterations. It’s not slow — it’s algorithmically broken for this use case.

Before/After with Real Timing Numbers

# SLOW — string += in loop
result = ""
large_list = [str(i) for i in range(100_000)]
for item in large_list:
result += item + ", "
# timeit result: ~3.8s for 100,000 items

# FAST — str.join()
result = ", ".join(large_list)
# timeit result: ~0.004s for 100,000 items
# ~950× faster — single allocation, one pass

# ALSO FAST — list append + join (when transformation needed before joining)
parts = []
for item in large_list:
parts.append(item.upper()) # transform first, join after
result = "".join(parts)
# Use this when you need to process each item before combining

# ALSO FAST — StringIO for complex multi-part string building
from io import StringIO

buf = StringIO()
for item in large_list:
buf.write(item)
buf.write("\n")
result = buf.getvalue()
# Best for cases where you're writing formatted output with mixed content

When f-Strings Are and Aren’t the Answer

f-strings are fast for single string formatting — they’re compiled to efficient bytecode and don’t have the immutability overhead of concatenation when used once. But f-strings inside a loop still create a new string object every iteration. The rule: if you’re building up a string piece by piece across iterations, use join(). If you’re formatting a single string with variable values in one shot, f-strings are the right choice.

9. N+1 Query Problem: Why Your DB Calls Are Killing Performance

The N+1 query problem is where an application fires one query to get a list of records, then fires one additional query per record to fetch related data. With 1000 records, that’s 1001 database round trips instead of 1. This is one of the most common causes of “too many database queries python” in Django and SQLAlchemy apps.

What N+1 Looks Like in Real Code

# N+1 BAD — Django ORM
# 1 query to get all orders, then 1 query per order to get user.name
orders = Order.objects.all()
for order in orders:
print(order.user.name) # hits DB every single iteration
# With 1000 orders: 1001 queries. With 10,000 orders: 10,001 queries.

# FIXED with select_related — 1 JOIN query, done
orders = Order.objects.select_related("user").all()
for order in orders:
print(order.user.name) # no additional DB hit — data already loaded
# With 1000 orders: 1 query. Period.

SQLAlchemy version of the same problem:

from sqlalchemy.orm import joinedload

# N+1 BAD
orders = session.query(Order).all()
for order in orders:
print(order.user.name) # lazy load fires per row

# FIXED with eager loading
orders = session.query(Order).options(joinedload(Order.user)).all()
for order in orders:
print(order.user.name) # already loaded in the initial query

How to Detect N+1 Without Guessing

# Method: SQLAlchemy query counter
from sqlalchemy import event

query_count = 0

@event.listens_for(engine, "before_cursor_execute")
def count_queries(conn, cursor, statement, parameters, context, executemany):
global query_count
query_count += 1

# Run your code, then print query_count
# If it's suspiciously high relative to record count — N+1 confirmed

For Django, install Django Debug Toolbar — it shows the SQL panel with every query, its execution time, and duplicate detection. It’s the fastest way to spot N+1 visually. For logging all queries without the toolbar:

# Django settings.py — log all SQL queries to console
LOGGING = {
"version": 1,
"handlers": {"console": {"class": "logging.StreamHandler"}},
"loggers": {
"django.db.backends": {
"handlers": ["console"],
"level": "DEBUG",
}
},
}

Fix Patterns Beyond select_related

# prefetch_related for ManyToMany relationships
orders = Order.objects.prefetch_related("tags").all()

# Bulk insert vs individual saves — 1000× difference at scale
# SLOW: 1000 INSERT queries
for item in items:
MyModel.objects.create(**item)

# FAST: 1 INSERT query
MyModel.objects.bulk_create([MyModel(**item) for item in items])

When the ORM generates a monster query with 12 JOINs that’s slower than the N+1 it replaced, raw SQL is the right move. Django’s raw() and SQLAlchemy’s text() exist for exactly this case. Connection pooling with SQLAlchemy’s pool_size parameter matters too — if your query count is reasonable but latency is high, you may be paying connection overhead on every request.

10. Quick Wins — 10 Changes That Actually Speed Up Python Code

These are concrete, mechanism-explained optimizations. Not “write better code” advice — actual changes with measurable effects.

Worth Reading

Python Observability

Python Observability Gaps That Kill Your Microservices at Scale When your Uvicorn workers start choking on 5000 req/s, you don't want dashboards full of uptime pings and memory RSS graphs. You want to know which...

1. Use slots on Classes You Instantiate Thousands of Times

By default, every Python object instance stores its attributes in a __dict__ — a hash table. For classes you create thousands of times, that’s thousands of hash tables. __slots__ replaces the dict with a fixed-size array, reducing memory per instance by 40–50% and speeding up attribute access.

class WithoutSlots:
def __init__(self, x, y):
self.x = x
self.y = y
# Each instance: ~232 bytes (includes __dict__ overhead)

class WithSlots:
__slots__ = ["x", "y"]
def __init__(self, x, y):
self.x = x
self.y = y
# Each instance: ~56 bytes — ~4× less memory at scale

2. Cache Expensive Function Results with @lru_cache

lru_cache memoizes function calls — if you call the same function with the same arguments twice, the second call returns the cached result instantly. Critical for recursive algorithms and repeated DB lookup simulations.

from functools import lru_cache

@lru_cache(maxsize=512) # cache up to 512 unique argument combinations
def expensive_computation(n):
# This runs once per unique n — subsequent calls hit the cache
return sum(range(n))

3. Use Local Variables Inside Hot Loops

In CPython, local variable lookup is a single array index operation. Global variable and attribute lookups require dictionary traversal. In a loop running millions of times, this overhead adds up to measurable slowdowns.

# SLOW — attribute lookup on every iteration
for item in big_list:
self.process(item) # self.process resolved every time

# FAST — cache to local before loop
process = self.process # one lookup
for item in big_list:
process(item) # local var lookup — faster in CPython

4. Use Sets for Membership Testing, Not Lists

List in operator is O(n) — it scans every element until it finds a match or exhausts the list. Set in is O(1) — hash lookup, constant time regardless of size. At 1M items, the difference is milliseconds vs seconds.

items_list = list(range(1_000_000))
items_set = set(range(1_000_000))

# SLOW — O(n) scan
if 999_999 in items_list: # scans up to 1M elements
pass

# FAST — O(1) hash lookup
if 999_999 in items_set: # one hash computation, done
pass

5. Precompile Regular Expressions

Calling re.search(pattern, text) inside a loop recompiles the pattern on every call. re.compile() does the compilation once and returns a reusable pattern object. For loops doing thousands of regex operations, this is a free performance gain.

import re

# SLOW — recompiles pattern every iteration
for line in lines:
if re.search(r"\d{4}-\d{2}-\d{2}", line):
process(line)

# FAST — compile once, use many times
DATE_PATTERN = re.compile(r"\d{4}-\d{2}-\d{2}")
for line in lines:
if DATE_PATTERN.search(line):
process(line)

6. Use dict.get() Instead of try/except for Missing Keys

Exception handling in Python has overhead — building the exception object, unwinding the stack, matching the except clause. For missing dict keys that are a common case (not an exceptional one), dict.get() is cleaner and faster. Reserve try/except for genuinely rare failure paths.

# SLOW for frequent misses — exception overhead adds up
try:
value = my_dict["key"]
except KeyError:
value = default_value

# FAST — no exception machinery for missing keys
value = my_dict.get("key", default_value)

7. Avoid Importing Inside Functions or Loops

Every import statement triggers a sys.modules lookup. Even for cached modules, that’s unnecessary work inside a hot loop. Beyond performance, imports inside functions are usually a sign of circular dependency workarounds or lazy loading gone wrong — both worth fixing at the architecture level.

# BAD — import inside a function called thousands of times
def process_item(item):
import json # sys.modules check every call
return json.dumps(item)

# GOOD — import at module top level
import json

def process_item(item):
return json.dumps(item)

8. Use enumerate() and zip() Instead of range(len())

range(len(lst)) creates an extra range object and requires an index lookup (lst[i]) on every iteration — two operations where one would do. enumerate() yields the index and value together in a single pass. It’s cleaner and marginally faster at scale.

# OLD WAY
for i in range(len(items)):
print(i, items[i]) # extra indexing operation per iteration

# BETTER
for i, item in enumerate(items):
print(i, item) # cleaner, one operation

# For two lists
for a, b in zip(list_a, list_b):
process(a, b) # no indexing at all

9. Replace pandas apply() with Vectorized Operations

apply() executes a Python function row by row — it’s essentially a for loop over the DataFrame in Python. Vectorized operations call into numpy’s C-level implementation, which processes entire arrays at once. The gap is typically 10–100× for numeric operations.

import pandas as pd

df = pd.DataFrame({"value": range(1_000_000)})

# SLOW — apply() loops in Python
df["doubled"] = df["value"].apply(lambda x: x * 2)
# timeit: ~0.8s

# FAST — vectorized operation
df["doubled"] = df["value"] * 2
# timeit: ~0.004s — ~200× faster

10. Use orjson Instead of json for Large JSON Serialization

Python’s stdlib json module is pure Python. orjson is a Rust-based JSON library that serializes large payloads 5–10× faster and handles datetime, numpy arrays, and UUIDs natively. For services doing heavy JSON work, this is a one-line change with a real throughput impact.

pip install orjson

import json
import orjson

large_data = {"records": list(range(100_000))}

# stdlib json
json.dumps(large_data) # ~45ms

# orjson
orjson.dumps(large_data) # ~8ms — ~5× faster, returns bytes
# orjson.loads() for deserialization — equally faster

FAQ

Why is my Python code slow even with a small dataset?

Small datasets expose startup costs, not data-volume costs. Look for heavy imports, unnecessary object initialization, synchronous HTTP calls on startup, or database connection overhead. Run cProfile on a cold start — if the slowness is front-loaded before your main logic runs, it’s initialization. Also check for accidental apply() usage or regex recompilation inside function calls that get hit repeatedly regardless of data size.

Does adding more RAM make Python code run faster?

Sometimes, but rarely in the way developers expect. More RAM prevents OOM crashes and swap usage (which is catastrophically slow), but it doesn’t speed up CPU-bound operations. If your code is slow because it’s thrashing swap on a 1GB CSV load, more RAM helps. If your code is slow because of O(n²) nested loops, more RAM does nothing. Profile before buying hardware.

Is asyncio worth using if I’m not a backend developer?

Yes, in specific cases — particularly data pipelines that make multiple API calls or scrape multiple URLs. If your script does 100 sequential HTTP requests and each takes 200ms, that’s 20 seconds synchronously. With asyncio + aiohttp, those 100 requests fire concurrently and complete in ~0.5–1s. But if your bottleneck is data processing rather than waiting for external responses, asyncio adds complexity with no benefit.

Why does Python use 100% CPU on only one core?

The GIL — Global Interpreter Lock — prevents true multi-threading in CPython. Even with multiple threads, only one thread runs Python bytecode at a time. If you need true parallelism for CPU-bound work, use multiprocessing — each process gets its own GIL and its own core. Threading in Python is appropriate for IO-bound work where threads spend most of their time waiting, not computing. This is a CPython implementation detail; PyPy and GraalPy handle this differently.

How much faster is NumPy than pure Python loops?

For numeric operations on arrays, typically 50–200× faster in practice. The gap comes from three factors: numpy operations run in C, they process entire arrays at once (SIMD-friendly), and they avoid Python’s per-iteration interpreter overhead. The exact speedup depends on operation type — simple element-wise math hits the high end; operations with complex conditionals or mixed types see less gain. The 14.3s → 0.08s benchmark in section 2 (matrix addition) is representative of the ceiling.

When should you NOT optimize your code?

When you haven’t profiled and don’t know the actual bottleneck. Premature optimization routinely makes code slower by optimizing the wrong thing — adding complexity without gains, or introducing bugs into non-critical paths. Also: when the code runs once, when the current performance is good enough for the use case, or when readability trade-offs aren’t worth the marginal speed improvement. If your script runs in 2 seconds and it’s a daily batch job, spending 3 days optimizing it to 0.5 seconds is the wrong call.

My code got slower after I “optimized” it — what happened?

Most likely: you optimized something that wasn’t the bottleneck, and the added complexity (extra function calls, data structure conversion overhead, unnecessary caching) made things worse. The second most common cause is introducing an unintended O(n²) operation — a common trap when converting a generator to a list to “optimize” a membership check that was already fast. Measure before and after every change. One change at a time. If it’s slower, revert.

Profile first. Fix the biggest single bottleneck. Measure again. Repeat. That loop — boring as it sounds — is the only reliable way to make code faster without making it worse. The one thing actually worth remembering: every optimization that isn’t backed by a measurement is just a guess, and guesses compound into systems nobody wants to maintain.

Written by:

Bart.F Burek

Related Articles