I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using
To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:
chunks = (len(df) / 10000) + 1 df_list = np.array_split(df, chunks)
pd.get_dummies(df) will automatically create new columns based on the contents of
df[col] and these are likely to differ for each
After processing, I am concatenating the DataFrames back together using:
for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) super_x = pd.concat([super_x, x], axis=0) super_y = pd.concat([super_y, y], axis=0) print datetime.datetime.utcnow()
The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the
preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to
Please see log below:
chunks 6 chunk 0 2016-04-08 00:22:17.728849 chunk 1 2016-04-08 00:22:42.387693 chunk 2 2016-04-08 00:23:43.124381 chunk 3 2016-04-08 00:25:30.249369 chunk 4 2016-04-08 00:28:11.922305 chunk 5 2016-04-08 00:32:00.357365
Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!
Open to any other suggestions in Python!
pd.concat inside a for-loop. It leads to quadratic copying.
pd.concat returns a new DataFrame. Space has to be allocated for the new
DataFrame, and data from the old DataFrames have to be copied into the new
DataFrame. Consider the amount of copying required by this line inside the
for-loop (assuming each
x has size 1):
super_x = pd.concat([super_x, x], axis=0) | iteration | size of old super_x | size of x | copying required | | 0 | 0 | 1 | 1 | | 1 | 1 | 1 | 2 | | 2 | 2 | 1 | 3 | | ... | | | | | N-1 | N-1 | 1 | N |
1 + 2 + 3 + ... + N = N(N+1)/2. So there is
O(N**2) copies required to
complete the loop.
super_x =  for i, df_chunk in enumerate(df_list): [x, y] = preprocess_data(df_chunk) super_x.append(x) super_x = pd.concat(super_x, axis=0)
Appending to a list is an
O(1) operation and does not require copying. Now
there is a single call to
pd.concat after the loop is done. This call to
pd.concat requires N copies to be made, since
DataFrames of size 1. So when constructed this way,
Every time you concatenate, you are returning a copy of the data.
You want to keep a list of your chunks, and then concatenate everything as the final step.
df_x =  df_y =  for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) df_x.append(x) df_y.append(y) super_x = pd.concat(df_x, axis=0) del df_x # Free-up memory. super_y = pd.concat(df_y, axis=0) del df_y # Free-up memory.