# Performant cartesian product (CROSS JOIN) with pandas

Posted on

Solving problem is about exposing yourself to as many situations as possible like Performant cartesian product (CROSS JOIN) with pandas and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Performant cartesian product (CROSS JOIN) with pandas, which can be followed any time. Take easy to follow this discuss.

Performant cartesian product (CROSS JOIN) with pandas

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

``````left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1  col2
0    A     1
1    B     2
2    C     3
right
col1  col2
0    X    20
1    Y    30
2    Z    50
``````

The cross product of these frames can be computed, and will look something like:

``````A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50
``````

What is the most performant method of computing this result?

Let’s start by establishing a benchmark. The easiest method for solving this is using a temporary “key” column:

``````# pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
``````
``````# pandas >= 1.2 (est)
left.merge(right, how="cross")
``````
``````  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50
``````

How this works is that both DataFrames are assigned a temporary “key” column with the same value (say, 1). `merge` then performs a many-to-many JOIN on “key”.

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle’s first implementation.

``````def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
``````

### Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames’ numeric indices using the aforementioned `cartesian_product`, use this to reindex the DataFrames, and

``````def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
``````

And, along similar lines,

``````left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1  col2
s1    A     1
s2    B     2
s1    C     3
right2
col1  col2
x    X    20
y    Y    30
y    Z    50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
``````

This solution can generalise to multiple DataFrames. For example,

``````def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1
``````

### Further Simplification

A simpler solution not involving @senderle’s `cartesian_product` is possible when dealing with just two DataFrames. Using `np.broadcast_arrays`, we can achieve almost the same level of performance.

``````def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
``````

### Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

Do note that timings may vary based on your setup, data, and choice of `cartesian_product` helper function as applicable.

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

``````from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
``````

* you are here

Using `itertools` `product` and recreate the value in dataframe

``````import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50
``````

Here’s an approach with triple `concat`

``````m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50
``````

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .