Learn by reading through in order

Regular Expressions re — Pattern Search and Replacement

Learn Python's re module from the ground up. Covers when to reach for re.match / re.search / re.findall, combining metacharacters \d / \w / \s / * / + / ?, capturing groups with ( ), substitution with re.sub, and reusing patterns with re.compile — with runnable practice exercises.

This article walks through the re module for regular expressions"extracting and replacing the substrings that match a specific pattern". Things you do constantly in real projects — parsing phone numbers, emails, log lines, and URLs — become one-liners.

A tool for trying regex live

Regular expressions have a lot of moving parts and are hard to reason about purely in your head. To check whether your pattern matches what you intended, the Regex Extractor runs entirely in the browser — type a pattern and some text and see the matches in real time. Keeping it open alongside this article makes it much easier to follow along.

match, search, and findall — Three Search Functions and When to Use Each

The re module exposes several search functions, and you pick among three depending on what you need. The names are descriptive — match matches at the start, search searches for one match anywhere, and findall finds them all. The exact search range, return type, and behavior on no-match are summarized in the next table.

FunctionSearch rangeReturnOn no match
re.matchStart of the string onlyMatch objectNone
re.searchFirst match anywhereMatch objectNone
re.findallAll matchesList of stringsEmpty list []

From the Match object that re.match and re.search return (an object holding the match position, matched string, and group info), you read the matched string by calling its `.group()` methodm.group() or m.group(0) for the whole match, and (with the capture groups introduced later) m.group(1) for just what was inside the parentheses. Only re.findall returns a list directly, so you don't call .group() on it.

How re.match / search / findall Differ
re.matchmatch from startIf the start doesn't match→ Nonere.searchfirst match anywhereMatch if foundNone otherwisere.findallall matchesList of matched strings([] if none)
match only checks whether the pattern appears at the start of the string. search returns the first match at any position. findall returns every match in a list.
MetacharacterMeaningExample
\dA single digit (0-9)\d+ → one or more digits
\wA word character (alphanumeric + underscore)\w+ → IDs and keywords
\sA whitespace character (space / tab / newline)Separators
.Any single character except newlineWildcard
*Zero or more of the previousa* → empty is OK too
+One or more of the previousa+ → at least one
?Zero or one of the previousOptional
[abc]One of a / b / cChoice
^ / $Start / end of stringAnchors
import re

text = "user_id: 12345, age: 30"

# match: from the start (\w+ is a run of word characters)
m = re.match(r"\w+", text)
print(m.group())            # user_id

# search: first run of digits anywhere
s = re.search(r"\d+", text)
print(s.group())            # 12345

# findall: every run of digits
nums = re.findall(r"\d+", text)
print(nums)                 # ['12345', '30']

Write regex as a raw string r"..."

Backslashes show up everywhere inside regex. A regular "\d" can have its escapes interpreted away by the string layer before re even sees it, so it's safer to write the raw string `r"\d"` with the leading r. Editors also tend to highlight raw strings as regex, which improves readability.

Pull an ID and numbers out of a log line. Try re.match / re.search / re.findall against the same string and observe how the results differ.

① Import re.

② Set text = "order_id: 9876, qty: 3, price: 1500".

③ Pull a run of word characters from the start of the string and print it as match: ◯◯.

④ Pull the first run of digits out of the string and print it as search: ◯◯.

⑤ Pull every run of digits out as a list and print it as findall: ◯◯.

(If your code runs correctly, the explanation will appear.)

Python Editor

Run code to see output

Capture Groups — Pulling Specific Parts Out of a Pattern

Anything you put in `( )` inside a regex becomes a capture group — instead of just the whole match, you can pull each piece out separately. Patterns like r"#(\d+) on (\d{4})-(\d{2})-(\d{2})" let you split an order number and a date out of a log line in one shot.

Call `.group(N)` on the Match object to read the N-th group (numbered from 1). .group(0) (or .group() with no argument) returns the whole match.

How Capture Groups Work
#(\d+) on(\d{4})-(\d{2})-(\d{2}).group(0)whole match.group(1)order number.group(2)year.group(3)month
Each `( )` in the regex becomes a group, addressable by 1-based index like .group(1) / .group(2). .group(0) is the whole match.
import re

text = "Order #1234 placed on 2024-03-15"

# What the pattern means:
#   #         → literal '#'
#   (\d+)     → one or more digits → group(1) order number
#   placed on → literal 'placed on'
#   (\d{4})   → 4 digits → group(2) year
#   (\d{2})   → 2 digits → group(3) month
#   (\d{2})   → 2 digits → group(4) day
m = re.search(r"#(\d+) placed on (\d{4})-(\d{2})-(\d{2})", text)
if m:
    print("whole:", m.group(0))    # #1234 placed on 2024-03-15
    print("order #:", m.group(1))   # 1234
    print("year:", m.group(2))      # 2024
    print("month:", m.group(3))     # 03
    print("day:", m.group(4))       # 15

Calling .group() when Match is None throws

When re.search doesn't find the pattern it returns None. Calling m.group() on that crashes with AttributeError: 'NoneType' object has no attribute 'group'. Always check with `if m:` first before .group(), or do both in one step using the walrus operator: if m := re.search(...): ....

Split an email address into username and domain. Use capture groups to pull both parts out in a single search.

① Import re.

② Set text = "contact us at alice@example.com".

③ Write an email pattern that captures what's on each side of the @ sign as separate groups.

- Left: word characters plus ., +, -, one or more

- Right: same kind of characters, ending with a domain like .com

④ When the match is found, print `username: ◯◯` and `domain: ◯◯`.

Python Editor

Run code to see output

re.sub — Replacing Pattern Matches

"Mask PII out of a log", "strip HTML tags and keep the body text", "normalize a mix of full-width and half-width whitespace" — they all come down to "rewrite anything matching a pattern into something else". The string replace only handles fixed substrings, but re.sub does it by pattern.

`re.sub(pattern, replacement, original)` returns a new string with each match replaced by the replacement. The original string is unchanged (Python strings are immutable, so you always work with the return value).

How re.sub Works
Original string"Tel: 03-1234-5678"re.sub(\d, *, ...)New string"Tel: **-****-****"
Returns a new string with each match replaced by the replacement string. The original is immutable; receive the result via the return value.
import re

# Mask digits in a phone number (replace each \d with one *)
text = "Tel: 03-1234-5678"
masked = re.sub(r"\d", "*", text)
print(masked)
# Tel: **-****-****

# Strip HTML tags to keep just the body text
html = "<p>Hello <b>World</b></p>"
plain = re.sub(r"<[^>]+>", "", html)
print(plain)
# Hello World

Mask out the digits in any phone number that appears in a log line by replacing them with ``.**

① Import re.

② Set text = "Contact: 03-1234-5678 or 090-9999-8888".

③ Use re.sub to *replace each `\d` digit with a single `**, and print the result as masked: ◯◯`.

④ Print the original text again as original: ◯◯ to confirm it's unchanged (re.sub only returns a new string).

Python Editor

Run code to see output

re.compile — Reusing a Pattern

When you use the same regex repeatedly, writing re.search(r"...", text) over and over makes the engine parse (compile) the pattern every time, which is wasted work. `re.compile(pattern)` builds a compiled pattern object once, and you call methods on it like pattern.search(...) / pattern.findall(...) / pattern.sub(...). The code reads better and runs faster.

How re.compile Is Used
r"\d{2,4}-\d{4}-\d{4}"re.compile(...)phone_re(pattern object)phone_re.search(text)phone_re.findall(text)phone_re.sub("*", text)
`re.compile(pattern)` gives you a pattern object you can call .search / .findall / .sub on as many times as you need. Compile when you reuse the same pattern.
import re

# Reuse the same phone-number pattern
phone_re = re.compile(r"\d{2,4}-\d{4}-\d{4}")

print(phone_re.findall("03-1234-5678 or 080-1111-2222"))
# ['03-1234-5678', '080-1111-2222']

print(phone_re.search("my phone is 03-9999-0000").group())
# 03-9999-0000

print(phone_re.sub("<phone>", "Contact: 03-1234-5678"))
# Contact: <phone>

Build the phone-number pattern once with `re.compile`, then count and substitute against the same text in a row.

① Import re.

② Set text = "Contact: 03-1234-5678 or 090-9999-8888".

③ Compile a phone-number pattern that's 2-4 digits + 4 digits + 4 digits with re.compile, and store it as phone_re.

④ Use phone_re.findall(text) to count how many phone numbers there are, and print it as phone count: ◯.

⑤ Use phone_re.sub to replace each entire phone number with `<phone>`, and print the result as replaced: ◯◯.

Python Editor

Run code to see output
QUIZ

Knowledge Check

Answer each question one by one.

Q1What does re.match(r"\d+", "abc 123") return?

Q2Which is the correct regex for one or more digits in a row?

Q3From re.search(r"(\w+)@(\w+)", "alice@example"), which call returns just the domain?

Q4What's the main reason to use a raw string `r"..."` when writing regex in Python?