Pandas: how to merge two dataframes on a column by keeping the information of the first one?

Posted on

Solving problem is about exposing yourself to as many situations as possible like Pandas: how to merge two dataframes on a column by keeping the information of the first one? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Pandas: how to merge two dataframes on a column by keeping the information of the first one?, which can be followed any time. Take easy to follow this discuss.

Pandas: how to merge two dataframes on a column by keeping the information of the first one?

I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information of the sex of people. Not all the people are in df1 nor in df2

df1
     Name   Age
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30
df2
     Name      Sex
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

I want to have the information of the sex of the people in df1 and setting NaN if I do not have this information in df2. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') but I keep the information of some people in df2 that I don’t want.

df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN
Asked By: emax

||

Answer #1:

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
                    'Age': [34, 18, 44, 27, 30]})
#print (df1)
df3 = df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})
print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30
df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'],
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67
#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN
#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'],
                    'Age': [34, 18, 44, 27, 30]})
print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30
df3, df4 = df1.copy(), df1.copy()
df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'],
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5
s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64
df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64
df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN
Answered By: jezrael

Answer #2:

You can also use the join method:

df1.set_index("Name").join(df2.set_index("Name"), how="left")

edit: added set_index("Name")

Answered By: Xiaoyu Lu
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published. Required fields are marked *