How to get matching rows from two n-dimensional arrays in Python? -


i have 2 numpy arrays have different number of rows , same number of columns. structure of every array - year, month, day, time, number_of_satellite, value_of_data. every array has different kind of data.

how can compare these 2 arrays common rows in both arrays, comparing parameter first 5 columns , 2 columns coresponding values. example:

a=[('2015', '1', '1', 0.0, 'g06', 46.29)  ('2015', '1', '1', 0.0, 'g12', 444.344)  ('2015', '1', '1', 0.0, 'g14', -99.269)  ('2015', '1', '1', 0.0, 'g20', 6.874)  ('2015', '1', '1', 0.0, 'g24', 158.488)  ('2015', '1', '1', 0.0, 'g25', -60.831)  ('2015', '1', '1', 0.0, 'g31', -48.234)  ('2015', '1', '1', 0.0, 'r07', -6.243)]  b=[('2015', '1', '1', 0.0, 'g06', '0.000')  ('2015', '1', '1', 0.0, 'g12', '0.000')  ('2015', '1', '1', 0.0, 'g14', '0.000')  ('2015', '1', '1', 0.0, 'g24', '0.000')  ('2015', '1', '1', 0.0, 'g25', '0.000')  ('2015', '1', '1', 0.0, 'g29', '0.000')  ('2015', '1', '1', 0.0, 'g31', '0.000')] 

the result, get:

c=[('2015', '1', '1', 0.0, 'g06', 46.29, '0.000')  ('2015', '1', '1', 0.0, 'g12', 444.344, '0.000')  ('2015', '1', '1', 0.0, 'g14', -99.269, '0.000')  ('2015', '1', '1', 0.0, 'g24', 158.488, '0.000')  ('2015', '1', '1', 0.0, 'g25', -60.831, '0.000')  ('2015', '1', '1', 0.0, 'g31', -48.234, '0.000')] 

i can using loop, not efficient solution, when have arrays 50000+ number of rows.

in backwater of numpy code there easy solution, recfunctions.join_by.

import numpy np  a=[('2015', '1', '1', 0.0, 'g06', 46.29),  ('2015', '1', '1', 0.0, 'g12', 444.344),  ('2015', '1', '1', 0.0, 'g14', -99.269),  ('2015', '1', '1', 0.0, 'g20', 6.874),  ('2015', '1', '1', 0.0, 'g24', 158.488),  ('2015', '1', '1', 0.0, 'g25', -60.831),  ('2015', '1', '1', 0.0, 'g31', -48.234),  ('2015', '1', '1', 0.0, 'r07', -6.243)]  b=[('2015', '1', '1', 0.0, 'g06', '0.000'),  ('2015', '1', '1', 0.0, 'g12', '0.000'),  ('2015', '1', '1', 0.0, 'g14', '0.000'),  ('2015', '1', '1', 0.0, 'g24', '0.000'),  ('2015', '1', '1', 0.0, 'g25', '0.000'),  ('2015', '1', '1', 0.0, 'g29', '0.000'),  ('2015', '1', '1', 0.0, 'g31', '0.000')]  dt=[('a', 's4'), ('b', 's1'), ('c', 's1'), ('d',float), ('e', 's3'), ('f',float)] aa=np.array(a,dt) ab=np.array(b,dt)  flds=list('abcde')  numpy.lib import recfunctions mrgd = recfunctions.join_by(flds, aa, ab, usemask=false) print(mrgd) print(mrgd.dtype) 

producing

[('2015', '1', '1', 0.0, 'g06', 46.29, 0.0)  ('2015', '1', '1', 0.0, 'g12', 444.344, 0.0)  ('2015', '1', '1', 0.0, 'g14', -99.269, 0.0)  ('2015', '1', '1', 0.0, 'g24', 158.488, 0.0)  ('2015', '1', '1', 0.0, 'g25', -60.831, 0.0)  ('2015', '1', '1', 0.0, 'g31', -48.234, 0.0)] [('a', 's4'), ('b', 's1'), ('c', 's1'), ('d', '<f8'), ('e', 's3'), ('f1', '<f8'), ('f2', '<f8')] 

in current organization recfunctions have imported separately. https://stackoverflow.com/a/33680606/901925

we'd have examine code see how implemented. , don't know, without further timing, how speed compares equivalent pandas.


with small sample, recfunctions faster pandas, if time required create dataframes included.

in [302]: %%timeit     .....: = pd.dataframe(a)    .....: b = pd.dataframe(b)    .....: c = pd.merge(a, b, 'inner', left_on=[0,1,2,3,4], right_on=[0,1,2,3,4])     .....:  100 loops, best of 3: 8.01 ms per loop in [303]: %%timeit    .....: aa=np.array(a,dt)    .....: ab=np.array(b,dt)    .....: ac=recfunctions.join_by(flds, aa, ab,usemask=false)    .....:  100 loops, best of 3: 3.35 ms per loop 

both slow compared numpy set operations in1d (which don't attempt merging):

in [308]: timeit np.intersect1d(aa[flds],ab[flds]) 1000 loops, best of 3: 326 µs per loop 

Comments