Python

Mastering Python's re Module: A Practical Deep Dive into Regular Expressions

Learn Python's re module from the ground up: search vs. match, groups and named captures, substitution with functions, lookarounds, flags, and the pitfalls that trip up almost everyone.

Anton Jacker

Jun 29, 2026 — 5 min read

Sooner or later every Python developer reaches for a regular expression — to validate an email, pull numbers out of a log line, or reshape some messy text. And sooner or later, that regex misbehaves in a way that feels like dark magic. The truth is that the re module is small, learnable, and predictable once you understand a handful of core ideas. This guide walks through the parts you will actually use, with runnable examples and the pitfalls that catch almost everyone.

search, match, and fullmatch: three ways to look

The first source of confusion is that Python gives you three closely related functions, and they are not interchangeable. re.search scans the whole string for the first place the pattern matches. re.match only succeeds if the pattern matches at the very start of the string. re.fullmatch requires the pattern to consume the entire string.

import re

text = "Order 12345 shipped"

re.search(r"\d+", text).group()   # '12345' — found anywhere
re.match(r"\d+", text)            # None — string doesn't start with a digit
re.match(r"Order", text).group()  # 'Order' — matches at the start

re.fullmatch(r"\d+", "98765").group()  # '98765'
re.fullmatch(r"\d+", "98765x")         # None — trailing 'x' isn't consumed

A practical rule: use search when hunting inside text, and fullmatch when validating that an entire value conforms to a format. match is the one people reach for by habit and then wonder why it returns None — when in doubt, prefer search or anchor your pattern explicitly with ^ and $.

Finding every match: findall and finditer

When you want all matches rather than just the first, re.findall returns a list, and re.finditer returns an iterator of rich match objects (so you keep positions and groups).

prices = "Apples $3, pears $5, plums $12"

re.findall(r"\$(\d+)", prices)   # ['3', '5', '12'] — the captured group only

for m in re.finditer(r"\$\d+", prices):
    print(m.start(), m.group())
# 7 $3
# 17 $5
# 27 $12

There is a subtle trap in findall: its return value depends on how many capturing groups your pattern has. With no groups it returns the full matches; with one group it returns that group's text; with several groups it returns a list of tuples. If you only added parentheses for grouping and not capturing, use a non-capturing group (?:...) to keep the output predictable.

Capturing data with groups and named groups

Parentheses capture sub-matches. You can grab them by index, but named groups — (?P<name>...) — make patterns far more readable and your code far more robust to edits.

log = "2026-06-29 14:03:55 ERROR disk full"

m = re.search(
    r"(?P<date>\d{4}-\d{2}-\d{2}) "
    r"(?P<time>\d{2}:\d{2}:\d{2}) "
    r"(?P<level>\w+)",
    log,
)

m.group("level")   # 'ERROR'
m.group("date")    # '2026-06-29'
m.groupdict()      # {'date': '2026-06-29', 'time': '14:03:55', 'level': 'ERROR'}

groupdict() is especially handy: it hands you a clean dictionary you can drop straight into a dataclass or a database row.

Compile once, use many times

If a pattern is used repeatedly — in a loop, or across many calls — compile it once with re.compile. The module-level functions cache compiled patterns internally, but an explicit compiled object reads better, gives you methods like .findall and .sub directly, and documents intent.

EMAIL = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")

EMAIL.findall("Contact a@x.com or b.c@sub.example.org")
# ['a@x.com', 'b.c@sub.example.org']

One honest caveat: validating email with a regex is genuinely hard, and no short pattern is fully correct. The one above is fine for catching typos in a form; for real verification, send a confirmation message instead.

Rewriting text with sub

re.sub replaces matches. Backreferences like \1 or \g<name> let you reuse captured text in the replacement, which makes reshaping data a one-liner.

# Collapse runs of whitespace
re.sub(r"\s+", " ", "too    many   spaces")   # 'too many spaces'

# Reorder "Last, First" into "First Last"
re.sub(r"(\w+),\s*(\w+)", r"\2 \1", "Doe, Jane")   # 'Jane Doe'

When a replacement needs real logic, pass a function instead of a string. It receives each match object and returns the replacement text — perfect for computed substitutions.

def cents_to_dollars(m):
    return f"${int(m.group(1)) / 100:.2f}"

re.sub(r"(\d+)c", cents_to_dollars, "price 250c")   # 'price $2.50'

Greedy vs. non-greedy matching

By default, quantifiers like + and * are greedy: they grab as much as possible. Add a ? to make them lazy and stop at the first opportunity. This is the single most common source of "why did my regex eat the whole line?"

re.search(r"<(.+)>", "<a><b>").group(1)    # 'a><b'  — greedy
re.search(r"<(.+?)>", "<a><b>").group(1)   # 'a'      — non-greedy

Flags that change the rules

Flags adjust how a pattern is interpreted. The three you will reach for most are re.IGNORECASE, re.MULTILINE (so ^ and $ match at line boundaries), and re.VERBOSE (which lets you space out a pattern and add comments).

re.findall(r"^\w+", "alpha\nbeta\ngamma", re.MULTILINE)
# ['alpha', 'beta', 'gamma']

phone = re.compile(r"""
    (\d{3})    # area code
    [-.\s]?
    (\d{3})    # prefix
    [-.\s]?
    (\d{4})    # line number
""", re.VERBOSE)

phone.search("call 415-555-2671 now").groups()
# ('415', '555', '2671')

Verbose mode is underused and well worth the habit: a phone or date pattern that would otherwise be an unreadable wall of symbols becomes self-documenting.

Lookahead and lookbehind

Sometimes you need to assert that something is (or isn't) nearby without including it in the match. Lookarounds do exactly that. A lookahead (?=...) checks what follows; a lookbehind (?<=...) checks what precedes. They match a position, not characters, which makes them ideal for extraction.

# Grab the number after a $ without capturing the $ itself
re.findall(r"(?<=\$)\d+", "$10 and $20")   # ['10', '20']

# Assert a string contains at least one digit (a password rule)
bool(re.search(r"(?=.*\d)", "abc1"))   # True

Splitting and escaping

re.split splits on a pattern rather than a fixed delimiter, which is great for inconsistent separators. And whenever you build a pattern from untrusted or arbitrary text, wrap it in re.escape so characters like . or + are treated literally instead of as metacharacters.

re.split(r"[,;]\s*", "a, b; c,d")   # ['a', 'b', 'c', 'd']

user_input = "a.b+c"
re.search(re.escape(user_input), "x a.b+c y").group()   # 'a.b+c'

Common pitfalls to remember

A few traps account for most regex bugs in the wild. Always use raw strings (r"...") for patterns so backslashes reach the regex engine intact. Watch greediness — reach for the lazy ? when a match runs too far. Mind findall's group behavior, using (?:...) when you group without wanting to capture. And know when not to use regex at all: parsing HTML or JSON is a job for a real parser, and simple checks like "@" in s or str.startswith are clearer and faster than a pattern.

Wrap-up and next steps

The re module rewards a small investment. Learn the difference between search, match, and fullmatch; lean on named groups and groupdict(); use sub with a function for anything non-trivial; and keep greediness and raw strings in mind. From here, explore the standard library docs for the full character-class reference, try the third-party regex package if you need advanced features like overlapping matches, and practice against your own log files — there is no faster way to internalize patterns than solving a real text-wrangling problem you actually have.