Regular Expressions
The re module provides Perl-style regular expressions for pattern matching, searching, and text manipulation. Always use raw strings r'...' for patterns to avoid double-escaping.
Key Facts
re.search() finds first match anywhere; re.match() only at start; re.fullmatch() entire string re.findall() returns list of all matches; re.finditer() returns iterator of Match objects re.sub() replaces matches; re.split() splits by pattern re.compile() pre-compiles pattern for reuse (faster with repeated use) - Greedy (
*, +) matches as much as possible; lazy (*?, +?) as little as possible - Match object:
.group() full match, .group(1) first capture group, .groups() all groups
Patterns
Core Functions
import re
re.search(r'\d+', 'abc 123 def') # Match at '123'
re.match(r'\d+', '123 abc') # Match at '123' (start only)
re.fullmatch(r'\d+', '123') # Match (entire string)
re.findall(r'\d+', 'a1 b22 c333') # ['1', '22', '333']
re.sub(r'\s+', ' ', 'a b c') # 'a b c'
re.split(r'[,;.]', 'a,b;c.d') # ['a', 'b', 'c', 'd']
Match Object
m = re.search(r'(\d+)-(\d+)', 'tel: 555-1234')
m.group() # '555-1234' (entire match)
m.group(1) # '555' (first group)
m.group(2) # '1234' (second group)
m.groups() # ('555', '1234')
m.start() # 5
m.end() # 13
m.span() # (5, 13)
Character Classes
| Pattern | Matches |
. | Any char except newline |
\d / \D | Digit / non-digit |
\w / \W | Word char [a-zA-Z0-9_] / non-word |
\s / \S | Whitespace / non-whitespace |
[abc] | Any of a, b, c |
[a-z] | Range |
[^abc] | NOT a, b, c |
Quantifiers
| Pattern | Meaning |
* / *? | 0+ (greedy / lazy) |
+ / +? | 1+ (greedy / lazy) |
? / ?? | 0 or 1 (greedy / lazy) |
{n} | Exactly n |
{n,m} | Between n and m |
Greedy vs Lazy
re.findall(r'<B>.*</B>', '<B>a</B> and <B>b</B>') # ['<B>a</B> and <B>b</B>']
re.findall(r'<B>.*?</B>', '<B>a</B> and <B>b</B>') # ['<B>a</B>', '<B>b</B>']
Anchors and Boundaries
# ^ start, $ end, \b word boundary
re.findall(r'\bcat\b', 'the cat scattered cats') # ['cat']
Groups and Backreferences
# Capturing groups
m = re.search(r'((ab)(cd))', 'abcd')
m.group(1) # 'abcd', m.group(2) # 'ab', m.group(3) # 'cd'
# Non-capturing group
re.findall(r'(?:ab)+', 'ababab') # ['ababab']
# Backreference (find repeated words)
re.findall(r'\b(\w+)\s+\1\b', 'the the cat') # ['the']
# Sub with backreference
re.sub(r'(\w+) (\w+)', r'\2 \1', 'hello world') # 'world hello'
Flags
re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
# re.I - case insensitive
# re.M - ^ and $ match line boundaries
# re.S - . matches newline
# re.X - verbose mode (comments, whitespace in pattern)
Common Practical Patterns
# Email (simplified)
r'[\w.-]+@[\w.-]+\.\w+'
# Phone
r'\+7-\d{3}-\d{3}-\d{2}-\d{2}'
# IP address
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
# URL
r'https?://[\w./\-?=&#]+'
# Normalize whitespace
re.sub(r'\s+', ' ', text).strip()
# Extract all numbers
numbers = [int(x) for x in re.findall(r'\d+', text)]
# Split by multiple delimiters
re.split(r'[,;.\s]+', text)
Gotchas
- Always use raw strings
r'...' to avoid \\d instead of \d re.match() only matches at string start - use re.search() for anywhere findall() with groups returns groups, not full match - use non-capturing (?:...) if needed re.split() with capturing groups includes the groups in the result re.escape(string) escapes all metacharacters for use as literal pattern
See Also