Q1What does re.match(r"\d+", "abc 123") return?
Regular Expressions re — Pattern Search and Replacement
Learn Python's re module from the ground up. Covers when to reach for re.match / re.search / re.findall, combining metacharacters \d / \w / \s / * / + / ?, capturing groups with ( ), substitution with re.sub, and reusing patterns with re.compile — with runnable practice exercises.
This article walks through the re module for regular expressions — "extracting and replacing the substrings that match a specific pattern". Things you do constantly in real projects — parsing phone numbers, emails, log lines, and URLs — become one-liners.
A tool for trying regex live
Regular expressions have a lot of moving parts and are hard to reason about purely in your head. To check whether your pattern matches what you intended, the Regex Extractor runs entirely in the browser — type a pattern and some text and see the matches in real time. Keeping it open alongside this article makes it much easier to follow along.
match, search, and findall — Three Search Functions and When to Use Each
The re module exposes several search functions, and you pick among three depending on what you need. The names are descriptive — match matches at the start, search searches for one match anywhere, and findall finds them all. The exact search range, return type, and behavior on no-match are summarized in the next table.
| Function | Search range | Return | On no match |
|---|---|---|---|
| re.match | Start of the string only | Match object | None |
| re.search | First match anywhere | Match object | None |
| re.findall | All matches | List of strings | Empty list [] |
From the Match object that re.match and re.search return (an object holding the match position, matched string, and group info), you read the matched string by calling its `.group()` method — m.group() or m.group(0) for the whole match, and (with the capture groups introduced later) m.group(1) for just what was inside the parentheses. Only re.findall returns a list directly, so you don't call .group() on it.
| Metacharacter | Meaning | Example |
|---|---|---|
| \d | A single digit (0-9) | \d+ → one or more digits |
| \w | A word character (alphanumeric + underscore) | \w+ → IDs and keywords |
| \s | A whitespace character (space / tab / newline) | Separators |
| . | Any single character except newline | Wildcard |
| * | Zero or more of the previous | a* → empty is OK too |
| + | One or more of the previous | a+ → at least one |
| ? | Zero or one of the previous | Optional |
| [abc] | One of a / b / c | Choice |
| ^ / $ | Start / end of string | Anchors |
import re
text = "user_id: 12345, age: 30"
# match: from the start (\w+ is a run of word characters)
m = re.match(r"\w+", text)
print(m.group()) # user_id
# search: first run of digits anywhere
s = re.search(r"\d+", text)
print(s.group()) # 12345
# findall: every run of digits
nums = re.findall(r"\d+", text)
print(nums) # ['12345', '30']
Write regex as a raw string r"..."
Backslashes show up everywhere inside regex. A regular "\d" can have its escapes interpreted away by the string layer before re even sees it, so it's safer to write the raw string `r"\d"` with the leading r. Editors also tend to highlight raw strings as regex, which improves readability.
Capture Groups — Pulling Specific Parts Out of a Pattern
Anything you put in `( )` inside a regex becomes a capture group — instead of just the whole match, you can pull each piece out separately. Patterns like r"#(\d+) on (\d{4})-(\d{2})-(\d{2})" let you split an order number and a date out of a log line in one shot.
Call `.group(N)` on the Match object to read the N-th group (numbered from 1). .group(0) (or .group() with no argument) returns the whole match.
.group(1) / .group(2). .group(0) is the whole match.import re
text = "Order #1234 placed on 2024-03-15"
# What the pattern means:
# # → literal '#'
# (\d+) → one or more digits → group(1) order number
# placed on → literal 'placed on'
# (\d{4}) → 4 digits → group(2) year
# (\d{2}) → 2 digits → group(3) month
# (\d{2}) → 2 digits → group(4) day
m = re.search(r"#(\d+) placed on (\d{4})-(\d{2})-(\d{2})", text)
if m:
print("whole:", m.group(0)) # #1234 placed on 2024-03-15
print("order #:", m.group(1)) # 1234
print("year:", m.group(2)) # 2024
print("month:", m.group(3)) # 03
print("day:", m.group(4)) # 15
Calling .group() when Match is None throws
When re.search doesn't find the pattern it returns None. Calling m.group() on that crashes with AttributeError: 'NoneType' object has no attribute 'group'. Always check with `if m:` first before .group(), or do both in one step using the walrus operator: if m := re.search(...): ....
re.sub — Replacing Pattern Matches
"Mask PII out of a log", "strip HTML tags and keep the body text", "normalize a mix of full-width and half-width whitespace" — they all come down to "rewrite anything matching a pattern into something else". The string replace only handles fixed substrings, but re.sub does it by pattern.
`re.sub(pattern, replacement, original)` returns a new string with each match replaced by the replacement. The original string is unchanged (Python strings are immutable, so you always work with the return value).
import re
# Mask digits in a phone number (replace each \d with one *)
text = "Tel: 03-1234-5678"
masked = re.sub(r"\d", "*", text)
print(masked)
# Tel: **-****-****
# Strip HTML tags to keep just the body text
html = "<p>Hello <b>World</b></p>"
plain = re.sub(r"<[^>]+>", "", html)
print(plain)
# Hello World
re.compile — Reusing a Pattern
When you use the same regex repeatedly, writing re.search(r"...", text) over and over makes the engine parse (compile) the pattern every time, which is wasted work. `re.compile(pattern)` builds a compiled pattern object once, and you call methods on it like pattern.search(...) / pattern.findall(...) / pattern.sub(...). The code reads better and runs faster.
.search / .findall / .sub on as many times as you need. Compile when you reuse the same pattern.import re
# Reuse the same phone-number pattern
phone_re = re.compile(r"\d{2,4}-\d{4}-\d{4}")
print(phone_re.findall("03-1234-5678 or 080-1111-2222"))
# ['03-1234-5678', '080-1111-2222']
print(phone_re.search("my phone is 03-9999-0000").group())
# 03-9999-0000
print(phone_re.sub("<phone>", "Contact: 03-1234-5678"))
# Contact: <phone>
Knowledge Check
Answer each question one by one.
Q2Which is the correct regex for one or more digits in a row?
Q3From re.search(r"(\w+)@(\w+)", "alice@example"), which call returns just the domain?
Q4What's the main reason to use a raw string `r"..."` when writing regex in Python?