Problem :
What is the difference between the search()
and match()
functions in the Python re
module?
I’ve read the documentation (current documentation), but I never seem to remember it. I keep having to look it up and re-learn it. I’m hoping that someone will answer it clearly with examples so that (perhaps) it will stick in my head. Or at least I’ll have a better place to return with my question and it will take less time to re-learn it.
Solution :
re.match
is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^
in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
correspondingMatchObject
instance.
ReturnNone
if the string does not
match the pattern; note that this is
different from a zero-length match.Note: If you want to locate a match
anywhere in string, usesearch()
instead.
re.search
searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
correspondingMatchObject
instance.
ReturnNone
if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match
. It is faster. Otherwise use search
.
The documentation has a specific section for match
vs. search
that also covers multiline strings:
Python offers two different primitive
operations based on regular
expressions:match
checks for a match
only at the beginning of the string,
whilesearch
checks for a match
anywhere in the string (this is what
Perl does by default).Note that
match
may differ fromsearch
even when using a regular expression
beginning with'^'
:'^'
matches only
at the start of the string, or in
MULTILINE
mode also immediately
following a newline. The “match
”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optionalpos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
search
⇒ find something anywhere in the string and return a match object.
match
⇒ find something at the beginning of the string and return a match object.
match is much faster than search, so instead of doing regex.search(“word”) you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let’s find out how many tons of performance you will really gain.
I prepared the following test suite:
import random
import re
import string
import time
LENGTH = 10
LIST_SIZE = 1000000
def generate_word():
word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)]
word = ''.join(word)
return word
wordlist = [generate_word() for _ in range(LIST_SIZE)]
start = time.time()
[re.search('python', word) for word in wordlist]
print('search:', time.time() - start)
start = time.time()
[re.match('(.*?)python(.*?)', word) for word in wordlist]
print('match:', time.time() - start)
I made 10 measurements (1M, 2M, …, 10M words) which gave me the following plot:
As you can see, searching for the pattern 'python'
is faster than matching the pattern '(.*?)python(.*?)'
.
Python is smart. Avoid trying to be smarter.
re.search
searches for the pattern throughout the string, whereas re.match
does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
You can refer the below example to understand the working of re.match
and re.search
a = "123abc"
t = re.match("[a-z]+",a)
t = re.search("[a-z]+",a)
re.match
will return none
, but re.search
will return abc
.
The difference is, re.match()
misleads anyone accustomed to Perl, grep, or sed regular expression matching, and re.search()
does not. 🙂
More soberly, As John D. Cook remarks, re.match()
“behaves as if every pattern has ^ prepended.” In other words, re.match('pattern')
equals re.search('^pattern')
. So it anchors a pattern’s left side. But it also doesn’t anchor a pattern’s right side: that still requires a terminating $
.
Frankly given the above, I think re.match()
should be deprecated. I would be interested to know reasons it should be retained.
Much shorter:
-
search
scans through the whole string. -
match
scans only the beginning of the string.
Following Ex says it:
>>> a = "123abc"
>>> re.match("[a-z]+",a)
None
>>> re.search("[a-z]+",a)
abc
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
Quick answer
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index)
re.match('test', ' test') # returns None (because the search start from 0 index)
re.match('test', 'test') # returns a Truthy match object (match at 0 index)