# How to join two dataframes for which column values are within a certain range?

Posted on

Solving problem is about exposing yourself to as many situations as possible like How to join two dataframes for which column values are within a certain range? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How to join two dataframes for which column values are within a certain range?, which can be followed any time. Take easy to follow this discuss.

How to join two dataframes for which column values are within a certain range?

Given two dataframes `df_1` and `df_2`, how to join them such that datetime column `df_1` is in between `start` and `end` in dataframe `df_2`:

``````print df_1
timestamp              A          B
0 2016-05-14 10:54:33    0.020228   0.026572
1 2016-05-14 10:54:34    0.057780   0.175499
2 2016-05-14 10:54:35    0.098808   0.620986
3 2016-05-14 10:54:36    0.158789   1.014819
4 2016-05-14 10:54:39    0.038129   2.384590
print df_2
start                end                  event
0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3
``````

Get corresponding `event` where `df1.timestamp` is between `df_2.start` and `df2.end`

``````  timestamp              A          B          event
0 2016-05-14 10:54:33    0.020228   0.026572   E1
1 2016-05-14 10:54:34    0.057780   0.175499   E2
2 2016-05-14 10:54:35    0.098808   0.620986   E2
3 2016-05-14 10:54:36    0.158789   1.014819   E2
4 2016-05-14 10:54:39    0.038129   2.384590   E3
``````

One simple solution is create `interval index` from `start and end` setting `closed = both` then use `get_loc` to get the event i.e (Hope all the date times are in timestamps dtype )

``````df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
``````

Output :

```            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3
```

First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.

``````idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0    E1
1    E2
1    E2
1    E2
2    E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3
``````

You can use the module pandasql

``````import pandasql as ps
sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1
inner join df_2
on d1.timestamp between df_2.start and df2.end
'''
newdf = ps.sqldf(sqlcode,locals())
``````

Option 1

``````idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
``````

Option 2

``````df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3
``````

In this method, we assume TimeStamp objects are used.

``````df2  start                end                  event
0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
``````

Explanation of `get_event`

For each timestamp in `df1`, say `t0 = 2016-05-14 10:54:33`,

`(t0 >= df2.start) & (t0 <= df2.end)` will contain 1 true. (See example 1). Then, take a dot product with `np.arange(event_num)` to get the index of the event that a `t0` belongs to.

Examples:

Example 1

``````    t0 >= df2.start    t0 <= df2.end     After &     np.arange(3)
0     True                True         ->  T              0        event_idx
1    False                True         ->  F              1     ->     0
2    False                True         ->  F              2
``````

Take `t2 = 2016-05-14 10:54:35` for another example

``````    t2 >= df2.start    t2 <= df2.end     After &     np.arange(3)
0     True                False        ->  F              0        event_idx
1     True                True         ->  T              1     ->     1
2    False                True         ->  F              2
``````

We finally use `transform` to transform each timestamp into an event.

The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .