# re.findall behaves weird

Posted on

Solving problem is about exposing yourself to as many situations as possible like re.findall behaves weird and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about re.findall behaves weird, which can be followed any time. Take easy to follow this discuss.

re.findall behaves weird

The source string is:

``````# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
``````

and here is my pattern:

``````pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
``````

however, `re.search` can give me correct result:

``````m = re.search(pattern, s)
print(m)  # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
``````

`re.findall` just dump out an empty list:

``````L = re.findall(pattern, s)
print(L)  # output: ['', '', '']
``````

why can’t `re.findall` give me the expected list:

``````['123', '3.1415926']
``````

``````s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:.[0-9]*)?|-?.[0-9]+',s)
``````

You dont need to escape twice when you are using raw mode.

Output:`['123', '3.1415926']`

Also the return type will be a list of strings. If you want return type as integers and floats use `map`

``````import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:.[0-9]*)?|-?.[0-9]+',s))
``````

Output: `[123, 3.1415926]`

There are two things to note here:

• `re.findall` returns captured texts if the regex pattern contains capturing groups in it
• the `r'\.'` part in your pattern matches two consecutive chars, and any char other than a newline.

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Note that to make `re.findall` return just match values, you may usually

• remove redundant capturing groups (e.g. `(a(b)c)` -> `abc`)
• convert all capturing groups into non-capturing (that is, replace `(` with `(?:`) unless there are backreferences that refer to the group values in the pattern (then see below)
• use `re.finditer` instead (`[x.group() for x in re.finditer(pattern, s)]`)

In your case, `findall` returned all captured texts that were empty because you have `\` within `r''` string literal that tried to match a literal .

To match the numbers, you need to use

``````-?d*.?d+
``````

The regex matches:

• `-?` – Optional minus sign
• `d*` – Optional digits
• `.?` – Optional decimal separator
• `d+` – 1 or more digits.

See demo

Here is IDEONE demo:

``````import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?d*.?d+'
L = re.findall(pattern, s)
print(L)
``````

Just to explain why you think that `search` returned what you want and `findall` didn’t?

search return a `SRE_Match` object that hold some information like:

• `string` : attribute contains the string that was passed to search function.
• `re` : `REGEX` object used in search function.
• `groups()` : list of string captured by the capturing groups inside the `REGEX`.
• `group(index)`: to retrieve the captured string by group using `index > 0`.
• `group(0)` : return the string matched by the `REGEX`.

`search` stops when It found the first mach build the `SRE_Match` Object and returning it, check this code:

``````import re
s = r'abc123d'
pattern = r'-?[0-9]+(.[0-9]*)?|-?.[0-9]+'
m = re.search(pattern, s)
print(m.string)  # 'abc123d'
print(m.group(0))  # REGEX matched 123
print(m.groups())  # there is only one group in REGEX (.[0-9]*) will  empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s)  # ', hello 3.1415926, this is my book'
print(m2.string)    # abc123d
print(m2.group(0))  # REGEX matched 3.1415926
print(m2.groups())  # the captured group has captured this part '.1415926'
``````

`findall` behave differently because it doesn’t just stop when It find the first mach it keeps extracting until the end of the text, but if the `REGEX` contains at least one capturing group the `findall` don’t return the matched string but the captured string by the capturing groups:

``````import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(.[0-9]*)?|-?.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['', '.1415926']
``````

the first `element` is return when the first mach was found witch is `'123'` the capturing group captured only `''`, but the second `element` was captured in the second match `'3.1415926'` the capturing group matched this part `'.1415926'`.

If you want to make the `findall` return matched string you should make all capturing groups `()` in your `REGEX` a non capturing groups`(?:)`:

``````import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:.[0-9]*)?|-?.[0-9]+'
m = re.findall(pattern, s)
print(m)  # ['123', '3.1415926']
``````
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .