What is the difference between the
match() functions in the Python
I’ve read the documentation (current documentation), but I never seem to remember it. I keep having to look it up and re-learn it. I’m hoping that someone will answer it clearly with examples so that (perhaps) it will stick in my head. Or at least I’ll have a better place to return with my question and it will take less time to re-learn it.
re.match is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using
^ in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
Noneif the string does not
match the pattern; note that this is
different from a zero-length match.
Note: If you want to locate a match
anywhere in string, use
re.search searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
Noneif no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
So if you need to match at the beginning of the string, or to match the entire string use
match. It is faster. Otherwise use
The documentation has a specific section for
search that also covers multiline strings:
Python offers two different primitive
operations based on regular
matchchecks for a match
only at the beginning of the string,
searchchecks for a match
anywhere in the string (this is what
Perl does by default).
matchmay differ from
even when using a regular expression
at the start of the string, or in
MULTILINEmode also immediately
following a newline. The “
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optional
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code: string_with_newlines = """something someotherthing""" import re print re.match('some', string_with_newlines) # matches print re.match('someother', string_with_newlines) # won't match print re.match('^someother', string_with_newlines, re.MULTILINE) # also won't match print re.search('someother', string_with_newlines) # finds something print re.search('^someother', string_with_newlines, re.MULTILINE) # also finds something m = re.compile('thing$', re.MULTILINE) print m.match(string_with_newlines) # no match print m.match(string_with_newlines, pos=4) # matches print m.search(string_with_newlines, re.MULTILINE) # also matches
search ⇒ find something anywhere in the string and return a match object.
match ⇒ find something at the beginning of the string and return a match object.
match is much faster than search, so instead of doing regex.search(“word”) you can do regex.match((.*?)word(.*?)) and gain tons of performance if you are working with millions of samples.
This comment from @ivan_bilan under the accepted answer above got me thinking if such hack is actually speeding anything up, so let’s find out how many tons of performance you will really gain.
I prepared the following test suite:
import random import re import string import time LENGTH = 10 LIST_SIZE = 1000000 def generate_word(): word = [random.choice(string.ascii_lowercase) for _ in range(LENGTH)] word = ''.join(word) return word wordlist = [generate_word() for _ in range(LIST_SIZE)] start = time.time() [re.search('python', word) for word in wordlist] print('search:', time.time() - start) start = time.time() [re.match('(.*?)python(.*?)', word) for word in wordlist] print('match:', time.time() - start)
I made 10 measurements (1M, 2M, …, 10M words) which gave me the following plot:
As you can see, searching for the pattern
'python' is faster than matching the pattern
Python is smart. Avoid trying to be smarter.
re.search searches for the pattern throughout the string, whereas
re.match does not search the pattern; if it does not, it has no other choice than to match it at start of the string.
You can refer the below example to understand the working of
re.match and re.search
a = "123abc" t = re.match("[a-z]+",a) t = re.search("[a-z]+",a)
re.match will return
re.search will return
The difference is,
re.match() misleads anyone accustomed to Perl, grep, or sed regular expression matching, and
re.search() does not. 🙂
More soberly, As John D. Cook remarks,
re.match() “behaves as if every pattern has ^ prepended.” In other words,
re.search('^pattern'). So it anchors a pattern’s left side. But it also doesn’t anchor a pattern’s right side: that still requires a terminating
Frankly given the above, I think
re.match() should be deprecated. I would be interested to know reasons it should be retained.
searchscans through the whole string.
matchscans only the beginning of the string.
Following Ex says it:
"123abc" re.match("[a-z]+",a) None re.search("[a-z]+",a) abca =
re.match attempts to match a pattern at the beginning of the string. re.search attempts to match the pattern throughout the string until it finds a match.
re.search('test', ' test') # returns a Truthy match object (because the search starts from any index) re.match('test', ' test') # returns None (because the search start from 0 index) re.match('test', 'test') # returns a Truthy match object (match at 0 index)