Question :
I have a dataframe with a timeindex and 3 columns containing the coordinates of a 3D vector:
x y z
ts
2014-05-15 10:38 0.120117 0.987305 0.116211
2014-05-15 10:39 0.117188 0.984375 0.122070
2014-05-15 10:40 0.119141 0.987305 0.119141
2014-05-15 10:41 0.116211 0.984375 0.120117
2014-05-15 10:42 0.119141 0.983398 0.118164
I would like to apply a transformation to each row that also returns a vector
def myfunc(a, b, c):
do something
return e, f, g
but if I do:
df.apply(myfunc, axis=1)
I end up with a Pandas series whose elements are tuples. This is beacause apply will take the result of myfunc without unpacking it. How can I change myfunc so that I obtain a new df with 3 columns?
Edit:
All solutions below work. The Series solution does allow for column names, the List solution seem to execute faster.
def myfunc1(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return pd.Series([e,f,g], index=['a', 'b', 'c'])
def myfunc2(args):
e=args[0] + 2*args[1]
f=args[1]*args[2] +1
g=args[2] + args[0] * args[1]
return [e,f,g]
%timeit df.apply(myfunc1 ,axis=1)
100 loops, best of 3: 4.51 ms per loop
%timeit df.apply(myfunc2 ,axis=1)
100 loops, best of 3: 2.75 ms per loop
Answer #1:
Just return a list instead of tuple.
In [81]: df
Out[81]:
x y z
ts
2014-05-15 10:38:00 0.120117 0.987305 0.116211
2014-05-15 10:39:00 0.117188 0.984375 0.122070
2014-05-15 10:40:00 0.119141 0.987305 0.119141
2014-05-15 10:41:00 0.116211 0.984375 0.120117
2014-05-15 10:42:00 0.119141 0.983398 0.118164
[5 rows x 3 columns]
In [82]: def myfunc(args):
....: e=args[0] + 2*args[1]
....: f=args[1]*args[2] +1
....: g=args[2] + args[0] * args[1]
....: return [e,f,g]
....:
In [83]: df.apply(myfunc ,axis=1)
Out[83]:
x y z
ts
2014-05-15 10:38:00 2.094727 1.114736 0.234803
2014-05-15 10:39:00 2.085938 1.120163 0.237427
2014-05-15 10:40:00 2.093751 1.117629 0.236770
2014-05-15 10:41:00 2.084961 1.118240 0.234512
2014-05-15 10:42:00 2.085937 1.116202 0.235327
Answer #2:
Return Series
and it will put them in a DataFrame.
def myfunc(a, b, c):
do something
return pd.Series([e, f, g])
This has the bonus that you can give labels to each of the resulting columns. If you return a DataFrame it just inserts multiple rows for the group.
Answer #3:
Based on the excellent answer by @U2EF1, I’ve created a handy function that applies a specified function that returns tuples to a dataframe field, and expands the result back to the dataframe.
def apply_and_concat(dataframe, field, func, column_names):
return pd.concat((
dataframe,
dataframe[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)
Usage:
df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
A
a 1
b 2
c 3
def func(x):
return x*x, x*x*x
print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])
A x^2 x^3
a 1 1 1
b 2 4 8
c 3 9 27
Hope it helps someone.
Answer #4:
Found a possible solution, by changing myfunc to return an np.array like this:
import numpy as np
def myfunc(a, b, c):
do something
return np.array((e, f, g))
any better solution?
Answer #5:
I’ve tried returning a tuple (I was using functions like scipy.stats.pearsonr
which return that kind of structures) but It returned a 1D Series instead of a Dataframe which was I expected. If I created a Series manually the performance was worse, so I fixed It using the result_type
as explained in the official API documentation:
Returning a Series inside the function is similar to passing
result_type=’expand’. The resulting column names will be the Series
index.
So you could edit your code this way:
def myfunc(a, b, c):
# do something
return (e, f, g)
df.apply(myfunc, axis=1, result_type='expand')
Answer #6:
Pandas 1.0.5 has DataFrame.apply with parameter result_type that can help here.
from the docs:
These only act when axis=1 (columns):
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This
is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the
original index and columns will be retained.
Answer #7:
Some of the other people’s answers contain mistakes, so I’ve summarized them below. The perfect answer is below.
Prepare the dataset. The version of pandas uses 1.1.5
.
import numpy as np
import pandas as pd
import timeit
# check pandas version
print(pd.__version__)
# 1.1.5
# prepare DataFrame
df = pd.DataFrame({
'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
index=[
'2014-05-15 10:38',
'2014-05-15 10:39',
'2014-05-15 10:40',
'2014-05-15 10:41',
'2014-05-15 10:42'],
columns=['x', 'y', 'z'])
df.index.name = 'ts'
# x y z
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211
# 2014-05-15 10:39 0.117188 0.984375 0.122070
# 2014-05-15 10:40 0.119141 0.987305 0.119141
# 2014-05-15 10:41 0.116211 0.984375 0.120117
# 2014-05-15 10:42 0.119141 0.983398 0.118164
Solution 01.
Returns pd.Series
in the apply function.
def myfunc1(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return pd.Series([e, f, g])
df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t1 = timeit.timeit(
'df.apply(myfunc1, axis=1)',
globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds
Solution 02.
Use result_type ='expand'
when applying.
def myfunc2(args):
e = args[0] + 2*args[1]
f = args[1]*args[2] + 1
g = args[2] + args[0] * args[1]
return [e, f, g]
df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t2 = timeit.timeit(
"df.apply(myfunc2, axis=1, result_type='expand')",
globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds
Solution 03.
If you want to make it faster, use np.vectorize
. Note that args cannot be a single argument when using np.vectorize
.
def myfunc3(args0, args1, args2):
e = args0 + 2*args1
f = args1*args2 + 1
g = args2 + args0 * args1
return [e, f, g]
df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
# x y z e f g
# ts
# 2014-05-15 10:38 0.120117 0.987305 0.116211 2.094727 1.114736 0.234803
# 2014-05-15 10:39 0.117188 0.984375 0.122070 2.085938 1.120163 0.237427
# 2014-05-15 10:40 0.119141 0.987305 0.119141 2.093751 1.117629 0.236770
# 2014-05-15 10:41 0.116211 0.984375 0.120117 2.084961 1.118240 0.234512
# 2014-05-15 10:42 0.119141 0.983398 0.118164 2.085937 1.116202 0.235327
t3 = timeit.timeit(
"pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds