re
¶
The Python standard library re also contains
functions for working with character strings. However, re
offers more
sophisticated options for pattern extraction and replacement than the
str type.
>>> import re
>>> re.sub("\n", "", welcome)
'Hello pythonistas!'
Here, the regular expression is first compiled and then its
re.Pattern.sub()
method is called for the passed text. You can compile
the expression itself with re.compile()
to create a reusable regex
object that reduces CPU cycles when applied to different strings:
>>> regex = re.compile("\n")
>>> regex.sub("", welcome)
'Hello pythonistas!'
If you want to get a list of all patterns that match the regex
object
instead, you can use the re.Pattern.findall()
method:
>>> regex.findall(welcome)
['\n']
Note
To avoid the awkward escaping with \
in a regular expression, you can use
raw string literals such as r'C:\PATH\TO\FILE'
instead of the
corresponding 'C:\\PATH\\TO\\FILE'
.
re.Pattern.match()
and re.Pattern.search()
are closely related
to re.Pattern.findall()
. While findall
returns all matches in a
string, search
only returns the first match and match
only returns
matches at the beginning of the string. As a less trivial example, consider a
block of text and a regular expression that can identify most email addresses:
>>> addresses = """Veit <veit@cusy.io>
... Veit Schiele <veit.schiele@cusy.io>
... cusy GmbH <info@cusy.io>
... """
>>> pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> regex.findall(addresses)
['veit@cusy.io', 'veit.schiele@cusy.io', 'info@cusy.io']
>>> regex.search(addresses)
<re.Match object; span=(6, 18), match='veit@cusy.io'>
>>> print(regex.match(addresses))
None
regex.match
returns None
, as the pattern only matches if it is at the
beginning of the string.
Suppose you want to find email addresses and at the same time split each address into its three components:
personal name
domain name
domain suffix
To do this, you first place round brackets ()
around the parts of the
pattern to be segmented:
>>> pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
>>> regex = re.compile(pattern, flags=re.IGNORECASE)
>>> match = regex.match("veit@cusy.io")
>>> match.groups()
('veit', 'cusy', 'io')
re.Match.groups()
returns a Tuples
containing all subgroups of the match.
re.Pattern.findall()
returns a list of tuples if the pattern contains
groups:
>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
Groups can also be used in re.Pattern.sub()
where \1
stands for the
first matching group, \2
for the second and so on:
>>> regex.findall(addresses)
[('veit', 'cusy', 'io'), ('veit.schiele', 'cusy', 'io'), ('info', 'cusy', 'io')]
>>> print(regex.sub(r"Username: \1, Domain: \2, Suffix: \3", addresses))
Veit <Username: veit, Domain: cusy, Suffix: io>
Veit Schiele <Username: veit.schiele, Domain: cusy, Suffix: io>
cusy GmbH <Username: info, Domain: cusy, Suffix: io>
The following table contains a brief overview of methods for regular expressions:
Method |
Description |
---|---|
returns all non-overlapping matching patterns in a string as a list. |
|
like |
|
matches the pattern at the beginning of the string and optionally segments
the pattern components into groups; if the pattern matches, a |
|
searches the string for matches to the pattern; in this case, returns a
|
|
splits the string into parts each time the pattern occurs. |
|
replaces all ( |
Checks¶
Which regular expression would you use to find strings that represent the numbers between -3 and +3?
Which regular expression would you use to find hexadecimal values?