Question :
I’m looking for the fastest and idiomatic analog to SQL MINUS (AKA EXCEPT) operator.
Here is what I mean – given two Pandas DataFrames as follows:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
How to find a result of d1 MINUS d2
taking into account only columns "a"
and "b"
in order to get the following result:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
What have I tried:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
it gives me correct results, but I have a feeling that there must be a more idiomatic and nicer / cleaner way to achieve that.
PS DataFrame.isin() method won’t help in this case as it’ll produce a wrong result set
Answer #1:
We can use pandas.concat
with drop_duplicates
here and pass it the argument to drop all duplicates with keep=False
:
pd.concat([d1, d2]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Edit after comment by OP
If you want to make sure that unique rows in df2
arnt taken into account, we can duplicate that df
:
pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Answer #2:
Execution time comparison for larger data sets:
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)
In [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)
In [102]: df1.shape
Out[102]: (700000, 3)
In [103]: df2.shape
Out[103]: (300000, 3)
pd.concat().drop_duplicates()
approach:
In [10]: %%timeit
...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates(['a', 'b'], keep=False)
...:
...:
2.59 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
multi-index NOT IS IN approach:
In [11]: %%timeit
...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]
...:
...:
484 ms ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
multi-index difference approach:
In [12]: %%timeit
...: tmp1 = df1.reset_index().set_index(["a", "b"])
...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)
...: res = df1.loc[tmp1.loc[idx, "index"]]
...:
...:
1.04 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
merge(how="outer")
approach – gives me a MemoryError
:
In [106]: %%timeit
...: res = (df1.reset_index()
...: .merge(df2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
...: .query('_merge == "left_only"')
...: .set_index('index')
...: .rename_axis(None)
...: .reindex(df1.columns, axis=1))
...:
...:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
compare concatenated strings approach:
In [13]: %%timeit
...: res = df1[~df1[['a','b']].astype(str).sum(axis=1).isin(df2[['a','b']].astype(str).sum(axis=1))]
...:
...:
2.05 s ± 65.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answer #3:
I am thinking a little bit like excel here:
d1[~d1[['a','b']].astype(str).sum(axis=1).isin(d2[['a','b']].astype(str).sum(axis=1))]
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Answer #4:
One possible solution with merge
and indicator=True
:
df = (d1.reset_index()
.merge(d2, on=['a','b'], indicator=True, how='outer', suffixes=('','_'))
.query('_merge == "left_only"')
.set_index('index')
.rename_axis(None)
.reindex(d1.columns, axis=1))
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Solution with isin
:
df = d1[~d1.set_index(["a", "b"]).index.isin(d2.set_index(["a","b"]).index)]
print (df)
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Answer #5:
I had similar question, I tried your idea
(
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
)
for test and it works.
However, I use the way in my sqlite, tow databases that have the same Structure,that means its tables and tables’ columns are the same, and it occurred some mistakes, it shows that this two df seems don’t have the same shap.
if u r happy to give me a hand and want more details, we can have a further conversation thanks a lot