Solving problem is about exposing yourself to as many situations as possible like How can I read large text files in Python, line by line, without loading it into memory? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How can I read large text files in Python, line by line, without loading it into memory?, which can be followed any time. Take easy to follow this discuss.
I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use
readlines() because it will create a very large list in the memory.
How will the code below work for this case? Is
xreadlines itself reading one by one into memory? Is the generator expression needed?
f = (line for line in open("log.txt").xreadlines()) # how much is loaded in memory? f.next()
Plus, what can I do to read this in reverse order, just as the Linux
Both worked very well!
I provided this answer because Keith’s, while succinct, doesn’t close the file explicitly
with open("log.txt") as infile: for line in infile: do_something_with(line)
All you need to do is use the file object as an iterator.
for line in open("log.txt"): do_something_with(line)
Even better is using context manager in recent Python versions.
with open("log.txt") as fileobject: for line in fileobject: do_something_with(line)
This will automatically close the file as well.
You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html
From the docs:
import fileinput for line in fileinput.input("filename"): process(line)
This will avoid copying the whole file into memory at once.
An old school approach:
fh = open(file_name, 'rt') line = fh.readline() while line: # do stuff with line line = fh.readline() fh.close()
Here’s what you do if you dont have newlines in the file:
with open('large_text.txt') as f: while True: c = f.read(1024) if not c: break print(c)
Please try this:
with open('filename','r',buffering=100000) as f: for line in f: print line
I couldn’t believe that it could be as easy as @john-la-rooy’s answer made it seem. So, I recreated the
cp command using line by line reading and writing. It’s CRAZY FAST.
#!/usr/bin/env python3.6 import sys with open(sys.argv, 'w') as outfile: with open(sys.argv) as infile: for line in infile: outfile.write(line)
The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.
dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.
import dask.dataframe as dd df = dd.read_csv('filename.csv') df.head(10) # return first 10 rows df.tail(10) # return last 10 rows # iterate rows for idx, row in df.iterrows(): ... # group by my_field and return mean df.groupby(df.my_field).value.mean().compute() # slice by column df[df.my_field=='XYZ'].compute()