Profiling and Performance Optimization¶
Never optimize without measuring. The 80/20 rule: 80% of runtime is often in 20% of the code. Profile first, then optimize algorithmically, then cache, then consider low-level tricks.
Key Facts¶
cProfileis the built-in CPU profiler;line_profilerfor line-by-line analysismemory_profilertracks memory usage per line@functools.lru_cacheprovides built-in memoization- String join is O(n) vs concatenation
+=in loop is O(n^2) dict/setlookup is O(1);listsearch is O(n)- Generator expressions use constant memory vs list comprehensions
Patterns¶
cProfile¶
import cProfile, pstats
cProfile.run('my_function()', 'output.prof')
stats = pstats.Stats('output.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)
Output columns: ncalls (call count), tottime (time excluding subcalls), cumtime (including subcalls).
line_profiler¶
@profile # decorate target function
def slow_function():
result = []
for i in range(10000):
result.append(i ** 2)
return sum(result)
memory_profiler¶
from memory_profiler import profile
@profile
def memory_heavy():
big_list = [i for i in range(1_000_000)]
del big_list
Optimization Methodology¶
- Measure - profile to find actual bottleneck
- Algorithmic - better data structures/algorithms (biggest impact)
- Caching - avoid redundant computation/IO
- I/O optimization - batch queries, connection pooling, async
- Python tricks - comprehensions, generators,
__slots__ - C extensions - NumPy, Cython (last resort)
Common Performance Tips¶
# List comprehension > loop with append
squares = [x**2 for x in range(1000)]
# Generator for large sequences
total = sum(x**2 for x in range(1_000_000))
# Set for membership testing
items = set(large_list) # convert once
if x in items: # O(1)
# String join > concatenation
result = ''.join(parts) # O(n) vs O(n^2) for +=
# __slots__ for memory
class Point:
__slots__ = ['x', 'y']
In-Memory Caching with TTL¶
import datetime
from functools import lru_cache
_cache = {}
def get_weather(city):
cached = _cache.get(city)
if cached and (datetime.datetime.now() - cached['time']).seconds < 300:
return cached['data']
data = call_weather_api(city)
_cache[city] = {'data': data, 'time': datetime.datetime.now()}
return data
# Built-in memoization
@lru_cache(maxsize=128)
def expensive(n):
return sum(i**2 for i in range(n))
Redis External Caching¶
import redis, json
r = redis.Redis(host='localhost', port=6379)
def get_data(key):
cached = r.get(key)
if cached:
return json.loads(cached)
data = fetch_from_db(key)
r.setex(key, 300, json.dumps(data)) # TTL 300 seconds
return data
Load Balancing¶
# Multiple Uvicorn workers
uvicorn app:app --workers 4
# Gunicorn with Uvicorn workers
gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker
Architecture: Client -> Nginx (reverse proxy) -> Gunicorn -> [Uvicorn workers]
Gotchas¶
- Premature optimization is the root of all evil - measure first
@lru_cachearguments must be hashable (no lists/dicts as args)- Redis adds infrastructure complexity but wins for multi-process/multi-server caching
- Generator expressions cannot be iterated twice - convert to list if needed
cProfileoverhead can skew results for very fast functions - usetimeitfor micro-benchmarks
See Also¶
- memory and internals - CPython memory model, GC
- async programming - async I/O for performance
- concurrency - parallel execution for CPU-bound work
- fastapi caching and tasks - caching in web applications