Introduction to regular expressions in Python (2020 tutorial)


 

1. Introduction to RegEx

You have been surfing on the web for a long time and sometimes you need to sign up with your email address for some services from a website. Every website will validate your email address first and if you have entered an invalid one, you will be rejected to register. Here is the question, how come the web service validate the email address you just entered?

The answer is regular expression. The web service will use a regular expression pattern to match against your email address; if match, you are allowed to register otherwise you will be rejected.

if "foo@google.com" match "pattern":
    allow_register()
else:
    deny_register()

Actually regular expression is a tiny programming language embedded with python. You could use this tiny programming language through its re module. So let's dive in.

2. A simple match

>>> import re
>>> re.match(r'hello', 'hello world!')
<re.Match object; span=(0, 5), match='hello'>

Just import re module and use its match() method; it takes two arguments, the first argument is the pattern we want our string to match against and the second is the string itself. The regular expression pattern is usually written as a raw string starting with a "r"  before the quote mark.

By default, matching starts from the beginning of the string; in this case if the string "hello world!" starts with "hello", it finds a match. If the string does not start with "hello" ,but exists inside the string, it finds no match.

>>> re.match(r'hello', 'hi, hello world!')

The "hi, hello world!" does not starts with "hello", so match() returns None.

2.1 match single character

Most characters could perfectly match themselves, such as regular expression r"food" will match "food". Howerver if we could only match the exact charaters, regular expression is just useless. 

Regular expression is very powerful which could solve most of the matching problems. For instance, we could match any lower case english letter:

>>> re.match(r'[abcdefghijklmnopqrstuvwxyz]', 'b')
<re.Match object; span=(0, 1), match='b'>
>>> re.match(r'[abcdefghijklmnopqrstuvwxyz]', 'l')
<re.Match object; span=(0, 1), match='l'>
>>> 

We uses the "[]" bracket metacharacter which will match any charater that appears between the brackets once. There are other metacharaters, such as + $, ? etc which be introduced later. 

r'[abcdefghijklmnopqrstuvwxyz]'

We put all lowercase letters inside the brackets and if the string in the second argument contains one single lowercase charater, the match() method returns a re.Match object.

Writing all lowercase letters in the bracket is just waste of time, we could use the shortcut:

r'[a-z]'

Likewise, [A-Z], [0-9] will match uppercase letters and number digits. If we want to match lowercase or uppercase or digits at the same time, we could write:

r'[a-zA-Z0-9]'

>>> re.match(r'[a-zA-Z0-9]', 'c')
<re.Match object; span=(0, 1), match='c'>
>>> re.match(r'[a-zA-Z0-9]', 'F')
<re.Match object; span=(0, 1), match='F'>
>>> re.match(r'[a-zA-Z0-9]', '8')
<re.Match object; span=(0, 1), match='8'>
>>> 

If we want to match charaters that is not a digit or a letter, we could use the ^ metacharacter inside the brackets and the ^ metacharater must appears right after the left bracket.

r'[^a-zA-Z0-9]'

>>> re.match(r'[^a-zA-Z0-9]', '8')
>>> re.match(r'[^a-zA-Z0-9]', '#')
<re.Match object; span=(0, 1), match='#'>
>>> re.match(r'[^a-zA-Z0-9]', 'G')
>>> re.match(r'[^a-zA-Z0-9]', '!')
<re.Match object; span=(0, 1), match='!'>
>>> 

The ^ metacharacter will invert the matching pattern that appears bebind it. As you could see, only charaters that are not letters or digits will be matched.

Regular expression defines some special sequences consisting of a "\" and a letter which represent a predefined set of characters.

\d Matches any decimal digit; this is equivalent to the class [0-9].

\D Matches any non-digit character; this is equivalent to the class [^0-9].

\s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

>>> re.match(r'\d', '9')
<re.Match object; span=(0, 1), match='9'>
>>> re.match(r'\D', 'w')
<re.Match object; span=(0, 1), match='w'>
>>> re.match(r'\s', ' ')
<re.Match object; span=(0, 1), match=' '>
>>> re.match(r'\S', 'j')
<re.Match object; span=(0, 1), match='j'>
>>> re.match(r'\w', '0')
<re.Match object; span=(0, 1), match='0'>
>>> re.match(r'\W', '*')
<re.Match object; span=(0, 1), match='*'>
>>> 

2.2 Match multiple characters

If we want to match multiple characters, we just need to put all the previous stuff together.

>>> re.match(r'\d\d\d-\d\d\d\d\d\d\d', '124-5698463')
<re.Match object; span=(0, 11), match='124-5698463'>

The RegEx will match a phone numer. The code above is lengthy and hard to manage, you could use {num} to specify the number of special sequence.

>>> re.match(r'\d{3}-\d{7}', '124-5698463')
<re.Match object; span=(0, 11), match='124-5698463'>

Sometimes when you enter your phone number, you normally type "1245698463" without '-' . 

>>> re.match(r'\d{3}-{0,1}\d{7}', '1245698463')
<re.Match object; span=(0, 10), match='1245698463'>

The hyphen "-" is followed by {0,1} indicating, the hyphen "-" may not appear or just once. Similarly r"\d{2,8}" indicates that at least 2 digits and at most 8 digits.

The "*" character is a metacharacter if is not used in brackets "[]". It does not match the * character, instead it indicates that the character before it could be matched at least zero times or more times.

>>> re.match(r'abc*', 'ab')
<re.Match object; span=(0, 2), match='ab'>
>>> re.match(r'abc*', 'abc')
<re.Match object; span=(0, 3), match='abc'>
>>> re.match(r'abc*', 'abcc')
<re.Match object; span=(0, 4), match='abcc'>
>>> re.match(r'abc*', 'abccccc')
<re.Match object; span=(0, 7), match='abccccc'>

The RegEx r"abc*" indicates that after matching "ab", the "c" character could be matched zero or more times. So "abc", "abcc", and "abccccc" are all matches.

The * metacharacter could also be written as {0,} with only the starting number 0.

>>> re.match(r'abc{0,}', 'ab')
<re.Match object; span=(0, 2), match='ab'>
>>> re.match(r'abc{0,}', 'abc')
<re.Match object; span=(0, 3), match='abc'>
>>> re.match(r'abc{0,}', 'abcc')
<re.Match object; span=(0, 4), match='abcc'>

The "+" character is also a metacharacter if not used inside brackets "[]". It does not match the "+" character, instead it indicates that the character before it could be matched at least one time or more times.

>>> re.match(r'abc+', 'ab')
>>> re.match(r'abc+', 'abc')
<re.Match object; span=(0, 3), match='abc'>
>>> re.match(r'abc+', 'abcc')
<re.Match object; span=(0, 4), match='abcc'>

The "+" metacharacter could aslo be replaced with {1:}.

>>> re.match(r'abc{1,}', 'ab')
>>> re.match(r'abc{1,}', 'abc')
<re.Match object; span=(0, 3), match='abc'>
>>> re.match(r'abc{1,}', 'abcc')
<re.Match object; span=(0, 4), match='abcc'>

There is another metacharacter "?" character. It indicates that the character before it could be matched zero times or just one time.

>>> re.match(r'abc?', 'ab')
<re.Match object; span=(0, 2), match='ab'>
>>> re.match(r'abc?', 'abc')
<re.Match object; span=(0, 3), match='abc'>

The "?" equivalent pattern is {0, 1}.

2.3 Indicating the start and end of a pattern

Look at the following variables names:

abc, 2pi, area#, pe&, HelloWorld, __good_, _____

According to the naming rules of python variables, variable name could only start with an english letter or underscore "_" and it could only contain number digits, letters or underscore. Let's write a regular expression to valid those variable names above.

>>> var_names = ['abc','2pi', 'area#', 'pe&', 'HelloWorld', "__good_", "_____"]
>>> for name in var_names:
res = re.match(r"[a-zA-Z_][a-zA-Z0-9_]*",name)
if res:
print(f"{name} is valid")
else:
print(f"{name} is not valid")

abc is valid
2pi is not valid
area# is valid
pe& is valid
HelloWorld is valid
__good_ is valid
_____ is valid
>>> 

As you could see the output, area# and pe& are valid which do not meet our expectations. The problem is due to the matching engine. If "area" matches, the engine will not check the validity of the character "#" and it returns the re.Match object. We could use another metacharacter "$" to specify the end of the string must match the pattern.

>>> for name in var_names:
res = re.match(r"[a-zA-Z_][a-zA-Z0-9_]*$",name)
if res:
print(f"{name} is valid")
else:
print(f"{name} is not valid")

abc is valid
2pi is not valid
area# is not valid
pe& is not valid
HelloWorld is valid
__good_ is valid
_____ is valid
>>> 

Now our output is correct.

Another metacharacter "^" indicates that the start of the string must match when it used outside of the brackets "[]".

>>> re.match(r"^[Hh]ello", "Hello World!")
<re.Match object; span=(0, 5), match='Hello'>

Because the engine always starts from the begining of the string, we usually omit "^" in the begining.

If we want to match picture formats, such as ".jpg", ".png", or "gif". We could write this:

>>> re.match(r".+\.(jpg|png|gif)", "new_pic.jpg")

<re.Match object; span=(0, 11), match='new_pic.jpg'>

>>> re.match(r".+\.(jpg|png|gif)", "new_pic.png")

<re.Match object; span=(0, 11), match='new_pic.png'>

>>> re.match(r".+\.(jpg|png|gif)", "new_pic.gif")

<re.Match object; span=(0, 11), match='new_pic.gif'>

>>> 

We use the parenthesis to include all possible formats divided by "|" character. As you could see, all three formats match.

The last metacharacter is "." which will match anything except for the "\n" character.

>>> text = "hello\nworld"
>>> re.match(r".+", text)
<re.Match object; span=(0, 5), match='hello'>
>>> 

As you could see, the "." metacharacter does not match "\n" character. 

If we want it to match the "\n" character, we could write:

>>> re.match(r".+", text, re.S)
<re.Match object; span=(0, 11), match='hello\nworld'>

We pass re.S as the third argument. As you could see, "\n" is matched.

Below is a list of metacharacters

. ^ $ * + ? { } [ ] \ | ( )

If you want to match those metacharacters within a reg-ex, you have use backward slash "\" to escape the metacharacter.

>>> re.match(r".+\.py$", "foo.py")
<re.Match object; span=(0, 6), match='foo.py'>

2.4 The backward slash problem

If you use the raw string as the regular expression, the problem could be solved easily. However if you use a normal string, your reg-ex could get ugly easily.

>>> file_path = 'C:\\documents\\user\\abc.py'
>>> re.match(r'C:\\documents\\user\\.+\.py', file_path)
<re.Match object; span=(0, 24), match='C:\\documents\\user\\abc.py'>

As you could see, in the regular expression if you need to match backward slash, you have to you "\" to escape the character; this only applys to the raw string. If your reg-ex is a normal string, you have to write this.

>>> re.match('C:\\\\documents\\\\user\\\\.+\\.py', file_path)
<re.Match object; span=(0, 24), match='C:\\documents\\user\\abc.py'>

Wow, this is so ugly and hard to manage; so as I said before, always use a raw string to avoid such mess.

2.5 Other metacharacters

"\b" matches word boundary

>>> re.match(r"\bboy\b", 'boy')
<re.Match object; span=(0, 3), match='boy'>
>>> re.match(r"\bboy\b", 'boyfriend')
>>> re.match(r"\bboy\b", 'hello, little boy!')

"\B" matches the oppsite of '\b" that only matches the position is not a word boundary

>>> re.match(r"boy\B", 'boyfriend')
<re.Match object; span=(0, 3), match='boy'>
>>> re.match(r"boy\B", 'hello, little boy!')

"\A" the same as "^" that mataches at the beginning of lines when not in "MULTILINE" mode. But if it is "MULTILINE" mode, it only matches the first line.

2.6. Fetch the matched string

The re.match() method returns a re.Match object which can be used to fetch the matched string through its Math.group() method.

>>> res = re.match(r'C:\\documents\\user\\.+\.py', file_path)
>>> res.group()
'C:\\documents\\user\\abc.py'

3. Grouping

Sometimes you need more information than the only one matched string. For example, we want to get the file name  from a matched string "C:\\documents\\user\\abc.py" .

We could use parentheses to group the the file name part and fetch the file name using group() method.

>>> res = re.match(r'C:\\documents\\user\\(.+\.py)', file_path)
>>> res.group()
'C:\\documents\\user\\abc.py'
>>> res.group(1)
'abc.py'
>>> 

If your regular expression has no grouping "()", it will automatically add just one group index "0". If you have grouping in your reg-ex, the group will be added and indexed starting from 1. Later you could use the index numbers to access the group content.

>>> res = re.match(r'(C:)\\documents\\user\\(.+\.py)', file_path)
>>> res.group(0)
'C:\\documents\\user\\abc.py'
>>> res.group(1)
'C:'
>>> res.group(2)
'abc.py'
>>> 

You could also repeat the group using *, +, ?, {num}, {m,n} repeating syntax.

3.1 Matching HTML tags

If you have learned HTML, you must be familar with "<h1>hello</h1>".  A starting tag "<h1>" must be paired with a closing tag "</h1>". If not, parsing errors could cause your website unreadable.

So let's write a reg-ex to valid this syntax.

>>> re.match(r'<h1>.*</h1>', '<h1>Hello</h1>')
<re.Match object; span=(0, 14), match='<h1>Hello</h1>'>

The code above could only match the exact "<h1>" tag, but there are tons of tags available such as "<h2>", "<h3>", "<p>". Let's write a more generic one:

>>> re.match(r'<.+>.*</.+>', '<h1>Hello</h1>')
<re.Match object; span=(0, 14), match='<h1>Hello</h1>'>
>>> re.match(r'<.+>.*</.+>', '<h2>Hello</h2>')
<re.Match object; span=(0, 14), match='<h2>Hello</h2>'>

The above code could match any tags, but it poses a change for us, not only it could match "<h1>Hello</h1>", alos it could match "<h1>Hello</h2>".

>>> re.match(r'<.+>.*</.+>', '<h1>Hello</h2>')
<re.Match object; span=(0, 14), match='<h1>Hello</h2>'>

Fortunately, we could let the reg-ex to match the same tag name using:

>>> re.match(r'<(.+)>.*</\1>', '<h1>Hello</h1>')
<re.Match object; span=(0, 14), match='<h1>Hello</h1>'>
>>> re.match(r'<(.+)>.*</\1>', '<h1>Hello</h2>')

We use "()" to group the first tag name pattern, and in the closing tag pattern, we use "\1" to indicate that the starting tag pattern must be matched again here where "1" indicates the index of the group.

3.2 Naming groups

You could also name the group in a reg-ex instead of just relying on its index. When you add lots of groups, relying on index number to retrieve content make you wonder which is which.

Instead you could use the python specific naming syntax to name each group:

>>> res = re.match(r'C:\\documents\\user\\(?P<file_name>.+\.py)', file_path)
>>> res.group('file_name')
'abc.py'

We use the syntax "(?P<name>...)" to name the group.

We could also use named group when backreferencing a pattern.

>>> re.match(r'<(?P<tag>.+)>.*</(?P=tag)>', '<h1>Hello</h1>')
<re.Match object; span=(0, 14), match='<h1>Hello</h1>'>

4. Python re specific methods 

re.search() : you could search the entire string to find a match, only one match is returend

>>> re.search(r'the', 'good the the file')
<re.Match object; span=(5, 8), match='the'>

If you use match(), it only matches at the begining of the string; in this case, it is no match.

>>> re.match(r'the', 'good the the file')

re.findall(): you could get all the matches and return a list

>>> re.findall(r'the', 'good the the file')
['the', 'the']

re.sub(): you could find all the matches and replace all the matches with new value

>>> re.sub(r'the','***', 'good the the file')
'good *** *** file'

re.split(): you could split the string using reg-ex

>>> re.split(r'(\?| |-)', 'hello?have a nice-day')
['hello', '?', 'have', ' ', 'a', ' ', 'nice', '-', 'day']
>>> 







Comments

Popular posts from this blog

How to write a slide puzzle game with Python and Pygame (2020 tutorial)

How to create a memory puzzle game with Python and Pygame (#005)

Introduction to multitasking with Python #001 multithreading (2020 tutorial)