Question :
I have a pandas DataFrame, df_test
. It contains a column ‘size’ which represents size in bytes. I’ve calculated KB, MB, and GB using the following code:
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')
df_test
dir size size_kb size_mb size_gb
0 /Users/uname1 994933 971.6 KB 0.9 MB 0.0 GB
1 /Users/uname2 109338711 106,776.1 KB 104.3 MB 0.1 GB
[2 rows x 5 columns]
I’ve run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.
Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?
The other questions I’ve found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.
Answer #1:
This is an old question, but for completeness, you can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1
to the apply function applies the function sizes
to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.
def sizes(s):
s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return s
df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)
Answer #2:
Use apply and zip will 3 times fast than Series way.
def sizes(s):
return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB',
locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB',
locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))
Test result are:
Separate df.apply():
100 loops, best of 3: 1.43 ms per loop
Return Series:
100 loops, best of 3: 2.61 ms per loop
Return tuple:
1000 loops, best of 3: 819 µs per loop
Answer #3:
Some of the current replies work fine, but I want to offer another, maybe more “pandifyed” option. This works for me with the current pandas 0.23 (not sure if it will work in previous versions):
import pandas as pd
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
def sizes(s):
a = locale.format_string("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")
Notice that the trick is on the result_type
parameter of apply
, that will expand its result into a DataFrame
that can be directly assign to new/old columns.
Answer #4:
Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.
def sizes(s):
val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])
df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)
A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
#foo bar
#0 1 2
#1 1 2
#2 1 2
Answer #5:
The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse’s answer: the argument passed in to the function, also affects performance.
(Python 3.7.4, Pandas 1.0.3)
import pandas as pd
import locale
import timeit
def create_new_df_test():
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
return df_test
def sizes_pass_series_return_series(series):
series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return series
def sizes_pass_series_return_tuple(series):
a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
Here are the results:
# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Notice how returning tuples is the fastest method, but what is passed in as an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.
Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.
But there’s more…
# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.
Here is the code for running the tests:
# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
Answer #6:
Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:
zip(* ...
... result_type="expand")
Although expand is kind of more elegant (pandifyed), zip is at least **2x faster. On this simple example bellow, I got 4x faster.
import pandas as pd
dat = [ [i, 10*i] for i in range(1000)]
df = pd.DataFrame(dat, columns = ["a","b"])
def add_and_sub(row):
add = row["a"] + row["b"]
sub = row["a"] - row["b"]
return add, sub
df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))
Answer #7:
I believe the 1.1 version breaks the behavior suggested in the top answer here.
import pandas as pd
def test_func(row):
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)
The above code ran on pandas 1.1.0 returns:
a b c d
0 1 i 1i 2
1 1 i 1i 2
2 1 i 1i 2
While in pandas 1.0.5 it returned:
a b c d
0 1 i 1i 2
1 2 j 2j 3
2 3 k 3k 4
Which I think is what you’d expect.
Not sure how the release notes explain this behavior, however as explained here avoiding mutation of the original rows by copying them resurrects the old behavior. i.e.:
def test_func(row):
row = row.copy() # <---- Avoid mutating the original reference
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
Answer #8:
Generally, to return multiple values, this is what I do
def gimmeMultiple(group):
x1 = 1
x2 = 2
return array([[1, 2]])
def gimmeMultipleDf(group):
x1 = 1
x2 = 2
return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)
Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply()
returns and play a bit with the functions 😉