I have a file in which lines are separated using a delimeter say
.. I want to read this file line by line, where lines should be based on presence of
. instead of newline.
One way is:
f = open('file','r') for line in f.read().strip().split('.'): #....do some work f.close()
But this is not memory efficient if my file is too large. Instead of reading a whole file together I want to read it line by line.
open supports a parameter ‘newline’ but this parameter only takes
None, '', 'n', 'r', and 'rn' as input as mentioned here.
Is there any way to read files line efficiently but based on a pre-specified delimiter?
You could use a generator:
def myreadlines(f, newline): buf = "" while True: while newline in buf: pos = buf.index(newline) yield buf[:pos] buf = buf[pos + len(newline):] chunk = f.read(4096) if not chunk: yield buf break buf += chunk with open('file') as f: for line in myreadlines(f, "."): print line
Here is a more efficient answer, using
bytearray that I used for parsing a PDF file –
import io import re # the end-of-line chars, separated by a `|` (logical OR) EOL_REGEX = b'rn|r|n' # the end-of-file char EOF = b'%%EOF' def readlines(fio): buf = bytearray(4096) while True: fio.readinto(buf) try: yield buf[: buf.index(EOF)] except ValueError: pass else: break for line in re.split(EOL_REGEX, buf): yield line with io.FileIO("test.pdf") as fio: for line in readlines(fio): ...
The above example also handles a custom EOF. If you don’t want that, use this:
import io import os import re # the end-of-line chars, separated by a `|` (logical OR) EOL_REGEX = b'rn|r|n' def readlines(fio, size): buf = bytearray(4096) while True: if fio.tell() >= size: break fio.readinto(buf) for line in re.split(EOL_REGEX, buf): yield line size = os.path.getsize("test.pdf") with io.FileIO("test.pdf") as fio: for line in readlines(fio, size): ...
The easiest way would be to preprocess the file to generate newlines where you want.
Here’s an example using perl (assuming you want the string ‘abc’ to be the newline):
perl -pe 's/abc/n/g' text.txt > processed_text.txt
If you also want to ignore the original newlines, use the following instead:
perl -ne 's/n//; s/abc/n/g; print' text.txt > processed_text.txt