Solving problem is about exposing yourself to as many situations as possible like Capturing repeating subpatterns in Python regex and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Capturing repeating subpatterns in Python regex, which can be followed any time. Take easy to follow this discuss.
While matching an email address, after I match something like yasar@webmail
, I want to capture one or more of (.w+)
(what I am doing is a little bit more complicated, this is just an example), I tried adding (.w+)+ , but it only captures last match. For example, yasar@webmail.something.edu.tr
matches but only include .tr
after yasar@webmail
part, so I lost .something
and .edu
groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?
Answer #1:
re
module doesn’t support repeated captures (regex
supports it):
>>> m = regex.match(r'([.w]+)@((w+)(.w+)+)', 'yasar@webmail.something.edu.tr')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']
In your case I’d go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip’s answer.
Answer #2:
This will work:
>>> regexp = r"[w.]+@(w+)(.w+)?(.w+)?(.w+)?(.w+)?(.w+)?"
>>> email_address = "william.adama@galactica.caprica.fleet.mil"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)
But it’s limited to a maximum of six subgroups. A better way to do this would be:
>>> m = re.match(r"[w.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']
Note that regexps are fine so long as the email addresses are simple – but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.
Answer #3:
You can fix the problem of (.w+)+
only capturing the last match by doing this instead: ((?:.w+)+)
Answer #4:
This is what you are looking for:
>>> import re
>>> s="yasar@webmail.something.edu.tr"
>>> r=re.compile(".w+")
>>> m=r.findall(s)
>>> m
['.something', '.edu', '.tr']