Question :
I have a file in which lines are separated using a delimeter say .
. I want to read this file line by line, where lines should be based on presence of .
instead of newline.
One way is:
f = open('file','r')
for line in f.read().strip().split('.'):
#....do some work
f.close()
But this is not memory efficient if my file is too large. Instead of reading a whole file together I want to read it line by line.
open
supports a parameter ‘newline’ but this parameter only takes None, '', 'n', 'r', and 'rn'
as input as mentioned here.
Is there any way to read files line efficiently but based on a pre-specified delimiter?
Answer #1:
You could use a generator:
def myreadlines(f, newline):
buf = ""
while True:
while newline in buf:
pos = buf.index(newline)
yield buf[:pos]
buf = buf[pos + len(newline):]
chunk = f.read(4096)
if not chunk:
yield buf
break
buf += chunk
with open('file') as f:
for line in myreadlines(f, "."):
print line
Answer #2:
Here is a more efficient answer, using FileIO
and bytearray
that I used for parsing a PDF file –
import io
import re
# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'rn|r|n'
# the end-of-file char
EOF = b'%%EOF'
def readlines(fio):
buf = bytearray(4096)
while True:
fio.readinto(buf)
try:
yield buf[: buf.index(EOF)]
except ValueError:
pass
else:
break
for line in re.split(EOL_REGEX, buf):
yield line
with io.FileIO("test.pdf") as fio:
for line in readlines(fio):
...
The above example also handles a custom EOF. If you don’t want that, use this:
import io
import os
import re
# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'rn|r|n'
def readlines(fio, size):
buf = bytearray(4096)
while True:
if fio.tell() >= size:
break
fio.readinto(buf)
for line in re.split(EOL_REGEX, buf):
yield line
size = os.path.getsize("test.pdf")
with io.FileIO("test.pdf") as fio:
for line in readlines(fio, size):
...
Answer #3:
The easiest way would be to preprocess the file to generate newlines where you want.
Here’s an example using perl (assuming you want the string ‘abc’ to be the newline):
perl -pe 's/abc/n/g' text.txt > processed_text.txt
If you also want to ignore the original newlines, use the following instead:
perl -ne 's/n//; s/abc/n/g; print' text.txt > processed_text.txt