Best way to join / merge by range in pandas

Posted on

Question :

Best way to join / merge by range in pandas

I’m frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let’s say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:

SELECT *
FROM A,B
WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that’s not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?

Asked By: Dimgold

||

Answer #1:

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ?easiest? way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn’t match.

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN
Answered By: piRSquared

Answer #2:

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

Answered By: Adonis

Answer #3:

I don’t know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That’s called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

Answered By: baloo

Answer #4:

lets take a simple example:

df=pd.DataFrame([2,3,4,5,6],columns=['A'])

returns

    A
0   2
1   3
2   4
3   5
4   6

now lets define a second dataframe

df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]

results in

    B_low   B_high
0   1       2
1   6       8
2   2       4
3   3       6
4   5       6

here we go; and we want output to be index 3 and A value 5

df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()

results in

    A
3   5.0
Answered By: suvy

Answer #5:

Consider that your A dataframe is

A = pd.DataFrame([[0,2],[1,3],[2,4],[3,5],[4,6]],columns=['A_id', 'A_value'])

and B dataframe is

B = pd.DataFrame([[0,1,2,'a'],[1,4,9,'b'],[2,2,5,'c'],[3,6,7,'d'],[4,8,9,'e']],columns=['B_id', 'B_low', 'B_high', 'B_name'])

using this below you will get the desired output

A = A[(A['A_value']>=B['B_low'])&(A['A_value']<=B['B_high'])]
Answered By: Akshay Kandul

Leave a Reply

Your email address will not be published. Required fields are marked *