Question :
I have the following dataframes:
> df1
id begin conditional confidence discoveryTechnique
0 278 56 false 0.0 1
1 421 18 false 0.0 1
> df2
concept
0 A
1 B
How do I merge on the indices to get:
id begin conditional confidence discoveryTechnique concept
0 278 56 false 0.0 1 A
1 421 18 false 0.0 1 B
I ask because it is my understanding that merge()
i.e. df1.merge(df2)
uses columns to do the matching. In fact, doing this I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
self._validate_specification()
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on
Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called “index”?
Answer #1:
Use merge
, which is inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join
, which is left join by default:
df1.join(df2)
Or concat
, which is outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
#default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
#default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
#default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
Answer #2:
you can use concat([df1, df2, …], axis=1) in order to concatenate two or more DFs aligned by indexes:
pd.concat([df1, df2, df3, ...], axis=1)
or merge for concatenating by custom fields / indexes:
# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])
# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)
or join for joining by index:
df1.join(df2)
Answer #3:
By default:
join
is a column-wise left join
pd.merge
is a column-wise inner join
pd.concat
is a row-wise outer join
pd.concat
:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2]
)
Dimensions of DataFrame should match along axis
Join
and pd.merge
:
can take DataFrame arguments
Answer #4:
A silly bug that got me: the joins failed because index dtypes
differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index
, the indices looked identical in Jupyter. It only came to light when saving to Excel…
Fixed with: df1[['key']] = df1[['key']].apply(pd.to_numeric)
Hopefully this saves somebody an hour!
Answer #5:
If u want to join two dataframes in pandas you can simply use available attributes like merge
or concatenate
.
For example if I have two dataframes df1
and df2
I can join them by:
newdataframe=merge(df1,df2,left_index=True,right_index=True)
Answer #6:
This answer has been resolved for a while and all the available
options are already out there. However in this answer I’ll attempt to
shed a bit more light on these options to help you understand when to
use what.
This post will go through the following topics:
- Merging with index under different conditions
- options for index-based joins:
merge
,join
,concat
- merging on indexes
- merging on index of one, column of other
- options for index-based joins:
- effectively using named indexes to simplify merging syntax
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge
withleft_index
andright_index
(orleft_on
andright_on
using named indexes)DataFrame.join
(joins on index)pd.concat
(joins on index)
PROS | CONS | |
---|---|---|
merge |
• supports inner/left/right/full |
• can only join two frames at a time |
join |
• supports inner/left (default)/right/full |
• only supports index-index joins |
concat |
• specializes in joining multiple DataFrames at a time |
• only supports inner/full (default) joins |
Index to index joins
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
Other types of joins (left, right, outer) follow similar syntax (and can be controlled using how=...
).
Notable Alternatives
-
DataFrame.join
defaults to a left outer join on the index.left.join(right, how='inner',)
If you happen to get
ValueError: columns overlap but no suffix specified
, you will need to specifylsuffix
andrsuffix=
arguments to resolve this. Since the column names are same, a differentiating suffix is required. -
pd.concat
joins on the index and can join two or more DataFrames at once. It does a full outer join by default.pd.concat([left, right], axis=1, sort=False)
For more information on
concat
, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge
a combination of left_index=True
and right_on=...
.
left.merge(right, left_index=True, right_on='key')
Other joins follow a similar structure. Note that only merge
can perform index to column joins. You can join on multiple levels/columns, provided the number of index levels on the left equals the number of columns on the right.
join
and concat
are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index
.
This post is an abridged version of my work in Pandas Merging 101. Please follow this link for more examples and other topics on merging.