Solving problem is about exposing yourself to as many situations as possible like pandas three-way joining multiple dataframes on columns and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about pandas three-way joining multiple dataframes on columns, which can be followed any time. Take easy to follow this discuss.
I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
How can I “join” together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person’s string name?
The join()
function in pandas specifies that I need a multiindex, but I’m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
Answer #1:
Assumed imports:
import pandas as pd
John Galt’s answer is basically a reduce
operation. If I have more than a handful of dataframes, I’d put them in a list like this (generated via list comprehensions or loops or whatnot):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name
in your example, I’d do the following:
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
Edit August 1, 2016: For those using Python 3: reduce
has been moved into functools
. So to use this function, you’ll first need to import that module:
from functools import reduce
Answer #2:
You could try this if you have 3 dataframes
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')
Answer #3:
This is an ideal situation for the join
method
The join
method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
The code would look something like this:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With @zero’s data, you could do this:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9
Answer #4:
This can also be done as follows for a list of dataframes df_list
:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')
Answer #5:
In python
3.6.3 with pandas
0.22.0 you can also use concat
as long as you set as index the columns you want to use for the joining
pd.concat(
(iDF.set_index('name') for iDF in [df1, df2, df3]),
axis=1, join='inner'
).reset_index()
where df1
, df2
, and df3
are defined as in John Galt’s answer
import pandas as pd
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32']
)
Answer #6:
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
Answer #7:
Simple Solution:
If the column names are similar:
df1.merge(df2,on='col_name').merge(df3,on='col_name')
If the column names are different:
df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})
Answer #8:
One does not need a multiindex to perform join operations.
One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name')
for example)
The join
operation is by default performed on index.
In your case, you just have to specify that the Name
column corresponds to your index.
Below is an example
A tutorial may be useful.
# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')