Capturing repeating subpatterns in Python regex

Posted on

Solving problem is about exposing yourself to as many situations as possible like Capturing repeating subpatterns in Python regex and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Capturing repeating subpatterns in Python regex, which can be followed any time. Take easy to follow this discuss.

Capturing repeating subpatterns in Python regex

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (.w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

Asked By: yasar

||

Answer #1:

re module doesn’t support repeated captures (regex supports it):

>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I’d go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip’s answer.

Answered By: jfs

Answer #2:

This will work:

>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it’s limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple – but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

Answered By: Li-aung Yip

Answer #3:

You can fix the problem of (.w+)+ only capturing the last match by doing this instead: ((?:.w+)+)

Answered By: Taymon

Answer #4:

This is what you are looking for:

>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']
Answered By: Tushar Vazirani
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published. Required fields are marked *