Subscribe to my email list now at http://jauyeung.net/subscribe/
Follow me on Twitter at https://twitter.com/AuMayeung
Many more articles at https://medium.com/@hohanga
Even more articles at http://thewebdev.info/
Python is a convenient language that’s often used for scripting, data science, and web development.
In this article, we’ll look at how to use regex with Python to make finding text easier.
Finding Patterns of Text with Regular Expressions
Regular expressions, or regexes, are descriptions for a pattern of text.
For instance, \d
represents a single digit. We can combine characters to create regexes to search text.
To use regexes to search for text, we have to import the re
module and then create a regex object with a regex string as follows:
import re
phone_regex = re.compile('\\d{3}-\d{3}-\d{4}')
The code above has the regex to search for a North American phone number.
Then if we have the following string:
msg = 'Joe\'s phone number is 555-555-1212'
We can look for the phone number inside msg
with the regex object’s search
method as follows:
import re
phone_regex = re.compile('\d{3}-\d{3}-\d{4}')
msg = 'Joe\'s phone number is 555-555-1212'
match = phone_regex.search(msg)
When we inspect the match
object, we see something like:
<re.Match object; span=(22, 34), match='555-555-1212'>
Then we can return a string representation of the match by calling the group
method:
phone = match.group()
phone
has the value '555-555-1212'
.
Grouping with Parentheses
We can use parentheses to group different parts of the result into its own match entry.
To do that with our phone number regex, we can write:
phone_regex = re.compile('(\d{3})-(\d{3})-(\d{4})')
Then when we call search
, we can either get the whole search string, or individual match groups.
group
takes an integer that lets us get the parts that are matched by the groups.
Therefore, we can rewrite our program to get the whole match and the individual parts of the phone number as follows:
import re
phone_regex = re.compile('(\d{3})-(\d{3})-(\d{4})')
msg = 'Joe\'s phone number is 123-456-7890'
match = phone_regex.search(msg)
phone = match.group()
area_code = match.group(1)
exchange_code = match.group(2)
station_code = match.group(3)
In the code above, phone
should be ‘123–456–7890’
since we passed in nothing to group
. Passing in 0 also returns the same thing.
area_code
should be '123'
since we passed in 1 to group
, which returns the first group match.
exchange_code
should be '456'
since we passed in 2 to group
, which returns the 2nd group match.
Finally, station_code
should be '7890'
since we passed in 3 to group
, which returns the 3rd group match.
If we want to pass in parentheses or any other special character as a character of the pattern rather than a symbol for the regex, then we have to put a \
before it.
Matching Multiple Groups with the Pipe
We can use the |
symbol, which is called a pipe to match one of many expressions.
For instance, we write the following to get the match:
import re
name_regex = re.compile('Jane|Joe')
msg = 'Jane and Joe'
match = name_regex.search(msg)
match = match.group()
match
should be 'Jane'
since this is the first match that’s found according to the regex.
We can combine pipes and parentheses to find a part of a string. For example, we can write the following code:
import re
snow_regex = re.compile(r'snow(man|mobile|shoe)')
msg = 'I am walking on a snowshoe'
snow_match = snow_regex.search(msg)
match = snow_match.group()
group_match = snow_match.group(1)
to get the whole match with match
, which has the value 'snowshoe'
.
group_match
should have the partial group match, which is 'shoe'
.
Optional Matching with the Question Mark
We can add a question mark to the end of a group, which makes the group optional for matching purposes.
For example, we can write:
import re
snow_regex = re.compile(r'snow(shoe)?')
msg = 'I am walking on a snowshoe'
msg_2 = 'I am walking on snow'
snow_match = snow_regex.search(msg)
snow_match_2 = snow_regex.search(msg_2)
Then snow_match.group()
returns 'snowshoe'
and snow_match.group(1)
returns 'shoe'
.
Since the (shoe)
group is optional, snow_match_2.group()
returns 'snow'
and snow_match_2.group(1)
returns None
.
Conclusion
We can use regexes to find patterns in strings. They’re denoted by a set of characters that defines a pattern.
In Python, we can use the re
module to create a regex object from a string.
Then we can use it to do searches with the search
method.
We can define groups with parentheses. Once we did that, we can call group
on the match object returned by search
.
The group is returned when we pass in an integer to get it by their position.
We can make groups optional with a question mark appended after the group.