re

The Python standard library re also contains functions for working with character strings. However, re offers more sophisticated options for pattern extraction and replacement than the str type.

>>> import re
>>> re.sub("\n", "", welcome)
'Hello pythonistas!'

Here, the regular expression is first compiled and then its re.Pattern.sub() method is called for the passed text. You can compile the expression itself with re.compile() to create a reusable regex object that reduces CPU cycles when applied to different strings:

>>> regex = re.compile("\n")
>>> regex.sub("", welcome)
'Hello pythonistas!'

If you want to get a list of all patterns that match the regex object instead, you can use the re.Pattern.findall() method:

>>> regex.findall(welcome)
['\n']

Note

To avoid the awkward escaping with \ in a regular expression, you can use raw string literals such as r'C:\PATH\TO\FILE' instead of the corresponding 'C:\\PATH\\TO\\FILE'.

re.Pattern.match() and re.Pattern.search() are closely related to re.Pattern.findall(). While findall returns all matches in a string, search only returns the first match and match only returns matches at the beginning of the string. As a less trivial example, consider a block of text and a regular expression that can identify most email addresses:

>>> addresses = """Veit <veit@cusy.io>
... Veit Schiele <veit.schiele@cusy.io>
... cusy GmbH <info@cusy.io>
... """
>>> pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> regex.findall(addresses)
['veit@cusy.io', 'veit.schiele@cusy.io', 'info@cusy.io']
>>> regex.search(addresses)
<re.Match object; span=(6, 18), match='veit@cusy.io'>
>>> print(regex.match(addresses))
None

regex.match returns None, as the pattern only matches if it is at the beginning of the string.

Suppose you want to find email addresses and at the same time split each address into its three components:

  1. personal name

  2. domain name

  3. domain suffix

To do this, you first place round brackets () around the parts of the pattern to be segmented:

>>> pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> match = regex.match("veit@cusy.io")
>>> match.groups()
('veit', 'cusy', 'io')

re.Match.groups() returns a Tuples containing all subgroups of the match.

re.Pattern.findall() returns a list of tuples if the pattern contains groups:

>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]

Groups can also be used in re.Pattern.sub() where \1 stands for the first matching group, \2 for the second and so on:

>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
>>> print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", addresses))
Veit <Username: veit, Domain: cusy, Suffix: io>
Veit Schiele <Username: veit.schiele, Domain: cusy, Suffix: io>
cusy GmbH <Username: info, Domain: cusy, Suffix: io>

The following table contains a brief overview of methods for regular expressions:

Method

Description

re.findall()

returns all non-overlapping matching patterns in a string as a list.

re.finditer()

like findall, but returns an iterator.

re.match()

matches the pattern at the beginning of the string and optionally segments the pattern components into groups; if the pattern matches, a match object is returned, otherwise none.

re.search()

searches the string for matches to the pattern; in this case, returns a match object; unlike match, the match can be anywhere in the string and not just at the beginning.

re.split()

splits the string into parts each time the pattern occurs.

re.sub(), re.subn()

replaces all (sub) or the first n occurrences (subn) of the pattern in the string with a replacement expression; uses the symbols \1, \2, … to refer to the elements of the match group.

Checks

  • Which regular expression would you use to find strings that represent the numbers between -3 and +3?

  • Which regular expression would you use to find hexadecimal values?