Solving problem is about exposing yourself to as many situations as possible like Python concatenate text files and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Python concatenate text files, which can be followed any time. Take easy to follow this discuss.
I have a list of 20 file names, like
['file1.txt', 'file2.txt', ...]. I want to write a Python script to concatenate these files into a new file. I could open each file by
f = open(...), read line by line by calling
f.readline(), and write each line into that new file. It doesn’t seem very “elegant” to me, especially the part where I have to read//write line by line.
Is there a more “elegant” way to do this in Python?
This should do it
For large files:
filenames = ['file1.txt', 'file2.txt', ...] with open('path/to/output/file', 'w') as outfile: for fname in filenames: with open(fname) as infile: for line in infile: outfile.write(line)
For small files:
filenames = ['file1.txt', 'file2.txt', ...] with open('path/to/output/file', 'w') as outfile: for fname in filenames: with open(fname) as infile: outfile.write(infile.read())
… and another interesting one that I thought of:
filenames = ['file1.txt', 'file2.txt', ...] with open('path/to/output/file', 'w') as outfile: for line in itertools.chain.from_iterable(itertools.imap(open, filnames)): outfile.write(line)
Sadly, this last method leaves a few open file descriptors, which the GC should take care of anyway. I just thought it was interesting
It automatically reads the input files chunk by chunk for you, which is more more efficient and reading the input files in and will work even if some of the input files are too large to fit into memory:
import shutil with open('output_file.txt','wb') as wfd: for f in ['seg1.txt','seg2.txt','seg3.txt']: with open(f,'rb') as fd: shutil.copyfileobj(fd, wfd)
That’s exactly what fileinput is for:
import fileinput with open(outfilename, 'w') as fout, fileinput.input(filenames) as fin: for line in fin: fout.write(line)
For this use case, it’s really not much simpler than just iterating over the files manually, but in other cases, having a single iterator that iterates over all of the files as if they were a single file is very handy. (Also, the fact that
fileinput closes each file as soon as it’s done means there’s no need to
close each one, but that’s just a one-line savings, not that big of a deal.)
There are some other nifty features in
fileinput, like the ability to do in-place modifications of files just by filtering each line.
As noted in the comments, and discussed in another post,
fileinput for Python 2.7 will not work as indicated. Here slight modification to make the code Python 2.7 compliant
with open('outfilename', 'w') as fout: fin = fileinput.input(filenames) for line in fin: fout.write(line) fin.close()
I don’t know about elegance, but this works:
import glob import os for f in glob.glob("file*.txt"): os.system("cat "+f+" >> OutFile.txt")
What’s wrong with UNIX commands ? (given you’re not working on Windows) :
ls | xargs cat | tee output.txt does the job ( you can call it from python with subprocess if you want)
outfile.write(infile.read()) # time: 2.1085190773010254s shutil.copyfileobj(fd, wfd, 1024*1024*10) # time: 0.60599684715271s
A simple benchmark shows that the shutil performs better.
An alternative to @inspectorG4dget answer (best answer to date 29-03-2016). I tested with 3 files of 436MB.
@inspectorG4dget solution: 162 seconds
The following solution : 125 seconds
from subprocess import Popen filenames = ['file1.txt', 'file2.txt', 'file3.txt'] fbatch = open('batch.bat','w') str ="type " for f in filenames: str+= f + " " fbatch.write(str + " > file4results.txt") fbatch.close() p = Popen("batch.bat", cwd=r"Drive:Pathtofolder") stdout, stderr = p.communicate()
The idea is to create a batch file and execute it, taking advantage of “old good technology”. Its semi-python but works faster. Works for windows.
If the files are not gigantic:
with open('newfile.txt','wb') as newf: for filename in list_of_files: with open(filename,'rb') as hf: newf.write(hf.read()) # newf.write('nnn') if you want to introduce # some blank lines between the contents of the copied files
If the files are too big to be entirely read and held in RAM, the algorithm must be a little different to read each file to be copied in a loop by chunks of fixed length, using
read(10000) for example.