Find shortest matches between two strings

Posted on

Question :

Find shortest matches between two strings

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

  • actual file size: 2GB
  • occurrences of ‘start’: ~ 12 M, evenly distributed
  • occurences of ‘end’: ~800, near the end of the file.

Answer #1:

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

Answered By: Eero Aaltonen

Answer #2:

Do it with code – basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []
Answered By: famousgarkin

Answer #3:

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You’d want to use something like

regex.findall(pattern, string, overlapped=True)

If you’re stuck with Python 2.x or something else that doesn’t have regex, it’s still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

Answered By: gkusner

Answer #4:

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in “end”.

Answered By: TheSoundDefense

Leave a Reply

Your email address will not be published.