I have writen a program that can be summarized as follows:
def loadHugeData(): #load it return data def processHugeData(data, res_queue): for item in data: #process it res_queue.put(result) res_queue.put("END") def writeOutput(outFile, res_queue): with open(outFile, 'w') as f res=res_queue.get() while res!='END': f.write(res) res=res_queue.get() res_queue = multiprocessing.Queue() if __name__ == '__main__': data=loadHugeData() p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue)) p.start() processHugeData(data, res_queue) p.join()
The real code (especially
writeOutput()) is a lot more complicated.
writeOutput() only uses these values that it takes as its arguments (meaning it does not reference
Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time).
So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.
The sub-process does not need to access, read or modify the data loaded by
loadHugeData() in any way. The sub-process only needs to use what the main process sends it trough
res_queue. And this leads me to my problem and question.
It seems to me that the sub-process gets its own copy of the huge dataset (when checking memory usage with
top). Is this true? And if so then how can i avoid id (using double memory essentially)?
I am using Python 2.6 and program is running on linux.
multiprocessing module is effectively based on the
fork system call which creates a copy of the current process. Since you are loading the huge data before you
fork (or create the
multiprocessing.Process), the child process inherits a copy of the data.
However, if the operating system you are running on implements COW (copy-on-write), there will only actually be one copy of the data in physical memory unless you modify the data in either the parent or child process (both parent and child will share the same physical memory pages, albeit in different virtual address spaces); and even then, additional memory will only be allocated for the changes (in
You can avoid this situation by calling
multiprocessing.Process before you load your huge data. Then the additional memory allocations will not be reflected in the child process when you load the data in the parent.