cartesian product in pandas

Posted on

Solving problem is about exposing yourself to as many situations as possible like cartesian product in pandas and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about cartesian product in pandas, which can be followed any time. Take easy to follow this discuss.

cartesian product in pandas

I have two pandas dataframes:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})

What is the best practice to get their cartesian product (of course without writing it explicitly like me)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})
Asked By: Idok

||

Answer #1:

If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})
merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

Output:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra

Answered By: Matti John

Answer #2:

Use pd.MultiIndex.from_product as an index in an otherwise empty dataframe, then reset its index, and you’re done.

a = [1, 2, 3]
b = ["a", "b", "c"]
index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])
pd.DataFrame(index = index).reset_index()

out:

   a  b
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  2  c
6  3  a
7  3  b
8  3  c
Answered By: Gijs

Answer #3:

This won’t win a code golf competition, and borrows from the previous answers – but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.

My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.

The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables – which is what we set up.

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)
Answered By: Rob Guderian

Answer #4:

Minimal code needed for this one. Create a common ‘key’ to cartesian merge the two:

df1['key'] = 0
df2['key'] = 0
df_cartesian = df1.merge(df2, how='outer')
Answered By: A.Kot

Answer #5:

With method chaining:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)
Answered By: pomber

Answer #6:

As an alternative, one can rely on the cartesian product provided by itertools: itertools.product, which avoids creating a temporary key or modifying the index:

import numpy as np
import pandas as pd
import itertools
def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())
    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

Quick test:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])
In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])
In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567
Answered By: Svend

Answer #7:

Presenting to you

pandas >= 1.2

left.merge(right, how='cross')

import pandas as pd
pd.__version__
# '1.2.0'
left = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
right = pd.DataFrame({'col3': [5, 6]})
left.merge(right, how='cross')
   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

Indexes are ignored in the result.

Implementation wise, this uses the join on common key column method as described in the accepted answer. The upsides of using the API is that it saves you a lot of typing and handles some corner cases pretty well. I’d almost always recommend this syntax as my first preference for cartesian product in pandas unless you’re looking for something more performant.

Answered By: cs95

Answer #8:

If you have no overlapping columns, don’t want to add one, and the indices of the data frames can be discarded, this may be easier:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))
Answered By: sergeyk

Leave a Reply

Your email address will not be published.