How to join two dataframes for which column values are within a certain range?

Posted on

Solving problem is about exposing yourself to as many situations as possible like How to join two dataframes for which column values are within a certain range? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How to join two dataframes for which column values are within a certain range?, which can be followed any time. Take easy to follow this discuss.

How to join two dataframes for which column values are within a certain range?

Given two dataframes df_1 and df_2, how to join them such that datetime column df_1 is in between start and end in dataframe df_2:

print df_1
  timestamp              A          B
0 2016-05-14 10:54:33    0.020228   0.026572
1 2016-05-14 10:54:34    0.057780   0.175499
2 2016-05-14 10:54:35    0.098808   0.620986
3 2016-05-14 10:54:36    0.158789   1.014819
4 2016-05-14 10:54:39    0.038129   2.384590
print df_2
  start                end                  event
0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3

Get corresponding event where df1.timestamp is between df_2.start and df2.end

  timestamp              A          B          event
0 2016-05-14 10:54:33    0.020228   0.026572   E1
1 2016-05-14 10:54:34    0.057780   0.175499   E2
2 2016-05-14 10:54:35    0.098808   0.620986   E2
3 2016-05-14 10:54:36    0.158789   1.014819   E2
4 2016-05-14 10:54:39    0.038129   2.384590   E3

Answer #1:

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :

            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3
Answered By: Bharath

Answer #2:

First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0    E1
1    E2
1    E2
1    E2
2    E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Reference: A question on IntervalIndex.get_indexer.

Answered By: cs95

Answer #3:

You can use the module pandasql

import pandasql as ps
sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1
inner join df_2
on d1.timestamp between df_2.start and df2.end
'''
newdf = ps.sqldf(sqlcode,locals())
Answered By: chris dorn

Answer #4:

Option 1

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values

Option 2

df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3
Answered By: BENY

Answer #5:

In this method, we assume TimeStamp objects are used.

df2  start                end                  event
   0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1
   1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2
   2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3
event_num = len(df2.event)
def get_event(t):
    event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
    return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)

Explanation of get_event

For each timestamp in df1, say t0 = 2016-05-14 10:54:33,

(t0 >= df2.start) & (t0 <= df2.end) will contain 1 true. (See example 1). Then, take a dot product with np.arange(event_num) to get the index of the event that a t0 belongs to.

Examples:

Example 1

    t0 >= df2.start    t0 <= df2.end     After &     np.arange(3)
0     True                True         ->  T              0        event_idx
1    False                True         ->  F              1     ->     0
2    False                True         ->  F              2

Take t2 = 2016-05-14 10:54:35 for another example

    t2 >= df2.start    t2 <= df2.end     After &     np.arange(3)
0     True                False        ->  F              0        event_idx
1     True                True         ->  T              1     ->     1
2    False                True         ->  F              2

We finally use transform to transform each timestamp into an event.

Answered By: Tai
The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.