Regular Expressions¶

The re module provides Perl-style regular expressions for pattern matching, searching, and text manipulation. Always use raw strings r'...' for patterns to avoid double-escaping.

Key Facts¶

re.search() finds first match anywhere; re.match() only at start; re.fullmatch() entire string
re.findall() returns list of all matches; re.finditer() returns iterator of Match objects
re.sub() replaces matches; re.split() splits by pattern
re.compile() pre-compiles pattern for reuse (faster with repeated use)
Greedy (*, +) matches as much as possible; lazy (*?, +?) as little as possible
Match object: .group() full match, .group(1) first capture group, .groups() all groups

Patterns¶

Core Functions¶

import re

re.search(r'\d+', 'abc 123 def')     # Match at '123'
re.match(r'\d+', '123 abc')          # Match at '123' (start only)
re.fullmatch(r'\d+', '123')          # Match (entire string)
re.findall(r'\d+', 'a1 b22 c333')    # ['1', '22', '333']
re.sub(r'\s+', ' ', 'a  b   c')      # 'a b c'
re.split(r'[,;.]', 'a,b;c.d')        # ['a', 'b', 'c', 'd']

Match Object¶

m = re.search(r'(\d+)-(\d+)', 'tel: 555-1234')
m.group()     # '555-1234' (entire match)
m.group(1)    # '555' (first group)
m.group(2)    # '1234' (second group)
m.groups()    # ('555', '1234')
m.start()     # 5
m.end()       # 13
m.span()      # (5, 13)

Character Classes¶

Pattern	Matches
`.`	Any char except newline
`\d` / `\D`	Digit / non-digit
`\w` / `\W`	Word char `[a-zA-Z0-9_]` / non-word
`\s` / `\S`	Whitespace / non-whitespace
`[abc]`	Any of a, b, c
`[a-z]`	Range
`[^abc]`	NOT a, b, c

Quantifiers¶

Pattern	Meaning
`` / `?`	0+ (greedy / lazy)
`+` / `+?`	1+ (greedy / lazy)
`?` / `??`	0 or 1 (greedy / lazy)
`{n}`	Exactly n
`{n,m}`	Between n and m

Greedy vs Lazy¶

re.findall(r'<B>.*</B>', '<B>a</B> and <B>b</B>')   # ['<B>a</B> and <B>b</B>']
re.findall(r'<B>.*?</B>', '<B>a</B> and <B>b</B>')  # ['<B>a</B>', '<B>b</B>']

Anchors and Boundaries¶

# ^ start, $ end, \b word boundary
re.findall(r'\bcat\b', 'the cat scattered cats')  # ['cat']

Groups and Backreferences¶

# Capturing groups
m = re.search(r'((ab)(cd))', 'abcd')
m.group(1)  # 'abcd', m.group(2)  # 'ab', m.group(3)  # 'cd'

# Non-capturing group
re.findall(r'(?:ab)+', 'ababab')  # ['ababab']

# Backreference (find repeated words)
re.findall(r'\b(\w+)\s+\1\b', 'the the cat')  # ['the']

# Sub with backreference
re.sub(r'(\w+) (\w+)', r'\2 \1', 'hello world')  # 'world hello'

Flags¶

re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
# re.I - case insensitive
# re.M - ^ and $ match line boundaries
# re.S - . matches newline
# re.X - verbose mode (comments, whitespace in pattern)

Common Practical Patterns¶

# Email (simplified)
r'[\w.-]+@[\w.-]+\.\w+'

# Phone
r'\+7-\d{3}-\d{3}-\d{2}-\d{2}'

# IP address
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

# URL
r'https?://[\w./\-?=&#]+'

# Normalize whitespace
re.sub(r'\s+', ' ', text).strip()

# Extract all numbers
numbers = [int(x) for x in re.findall(r'\d+', text)]

# Split by multiple delimiters
re.split(r'[,;.\s]+', text)

Gotchas¶

Always use raw strings r'...' to avoid \\d instead of \d
re.match() only matches at string start - use re.search() for anywhere
findall() with groups returns groups, not full match - use non-capturing (?:...) if needed
re.split() with capturing groups includes the groups in the result
re.escape(string) escapes all metacharacters for use as literal pattern

Regular Expressions¶

Key Facts¶

Patterns¶

Core Functions¶

Match Object¶

Character Classes¶

Quantifiers¶

Greedy vs Lazy¶

Anchors and Boundaries¶

Groups and Backreferences¶

Flags¶

Common Practical Patterns¶

Gotchas¶

See Also¶