I have a large log file, and I want to extract a multi-line string between two strings:
The following is sample from the
start spam start rubbish start wait for it... profit! here end start garbage start second match win. end
The desired solution should print:
start wait for it... profit! here end start second match win. end
I tried a simple regex but it returned everything from
start spam. How should this be done?
Edit: Additional info on real-life computational complexity:
- actual file size: 2GB
- occurrences of ‘start’: ~ 12 M, evenly distributed
- occurences of ‘end’: ~800, near the end of the file.
This regex should match what you want:
re.findall method and single-line modifier
re.S to get all the occurences in a multi-line string:
re.findall('(start((?!start).)*?end)', text, re.S)
See a test here.
Do it with code – basic state machine:
open = False tmp =  for ln in fi: if 'start' in ln: if open: tmp =  else: open = True if open: tmp.append(ln) if 'end' in ln: open = False for x in tmp: print x tmp = 
This is tricky to do because by default, the
re module does not look at overlapping matches. Newer versions of Python have a new
regex module that allows for overlapping matches.
You’d want to use something like
regex.findall(pattern, string, overlapped=True)
If you’re stuck with Python 2.x or something else that doesn’t have
regex, it’s still possible with some trickery. One brilliant person solved it here:
Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.
You could do
(?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in “end”.