How can I read large text files in Python, line by line, without loading it into memory?

Posted on

Solving problem is about exposing yourself to as many situations as possible like How can I read large text files in Python, line by line, without loading it into memory? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How can I read large text files in Python, line by line, without loading it into memory?, which can be followed any time. Take easy to follow this discuss.

How can I read large text files in Python, line by line, without loading it into memory?

I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() because it will create a very large list in the memory.

How will the code below work for this case? Is xreadlines itself reading one by one into memory? Is the generator expression needed?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?
f.next()

Plus, what can I do to read this in reverse order, just as the Linux tail command?

I found:

http://code.google.com/p/pytailer/

and

python head, tail and backward read by lines of a text file

Both worked very well!

Answer #1:

I provided this answer because Keith’s, while succinct, doesn’t close the file explicitly

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)
Answered By: John La Rooy

Answer #2:

All you need to do is use the file object as an iterator.

for line in open("log.txt"):
    do_something_with(line)

Even better is using context manager in recent Python versions.

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

This will automatically close the file as well.

Answered By: Keith

Answer #3:

You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html

From the docs:

import fileinput
for line in fileinput.input("filename"):
    process(line)

This will avoid copying the whole file into memory at once.

Answered By: Mikola

Answer #4:

An old school approach:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()
Answered By: PTBNL

Answer #5:

Here’s what you do if you dont have newlines in the file:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c)
Answered By: Ariel Cabib

Answer #6:

Please try this:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line
Answered By: jyoti das

Answer #7:

I couldn’t believe that it could be as easy as @john-la-rooy’s answer made it seem. So, I recreated the cp command using line by line reading and writing. It’s CRAZY FAST.

#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)
Answered By: Bruno Bronosky

Answer #8:

The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.

import dask.dataframe as dd
df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows
# iterate rows
for idx, row in df.iterrows():
    ...
# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()
# slice by column
df[df.my_field=='XYZ'].compute()
Answered By: jpp

Leave a Reply

Your email address will not be published.