i have 2 numpy arrays have different number of rows , same number of columns. structure of every array - year, month, day, time, number_of_satellite, value_of_data. every array has different kind of data.
how can compare these 2 arrays common rows in both arrays, comparing parameter first 5 columns , 2 columns coresponding values. example:
a=[('2015', '1', '1', 0.0, 'g06', 46.29) ('2015', '1', '1', 0.0, 'g12', 444.344) ('2015', '1', '1', 0.0, 'g14', -99.269) ('2015', '1', '1', 0.0, 'g20', 6.874) ('2015', '1', '1', 0.0, 'g24', 158.488) ('2015', '1', '1', 0.0, 'g25', -60.831) ('2015', '1', '1', 0.0, 'g31', -48.234) ('2015', '1', '1', 0.0, 'r07', -6.243)] b=[('2015', '1', '1', 0.0, 'g06', '0.000') ('2015', '1', '1', 0.0, 'g12', '0.000') ('2015', '1', '1', 0.0, 'g14', '0.000') ('2015', '1', '1', 0.0, 'g24', '0.000') ('2015', '1', '1', 0.0, 'g25', '0.000') ('2015', '1', '1', 0.0, 'g29', '0.000') ('2015', '1', '1', 0.0, 'g31', '0.000')]
the result, get:
c=[('2015', '1', '1', 0.0, 'g06', 46.29, '0.000') ('2015', '1', '1', 0.0, 'g12', 444.344, '0.000') ('2015', '1', '1', 0.0, 'g14', -99.269, '0.000') ('2015', '1', '1', 0.0, 'g24', 158.488, '0.000') ('2015', '1', '1', 0.0, 'g25', -60.831, '0.000') ('2015', '1', '1', 0.0, 'g31', -48.234, '0.000')]
i can using loop, not efficient solution, when have arrays 50000+ number of rows.
in backwater of numpy
code there easy solution, recfunctions.join_by
.
import numpy np a=[('2015', '1', '1', 0.0, 'g06', 46.29), ('2015', '1', '1', 0.0, 'g12', 444.344), ('2015', '1', '1', 0.0, 'g14', -99.269), ('2015', '1', '1', 0.0, 'g20', 6.874), ('2015', '1', '1', 0.0, 'g24', 158.488), ('2015', '1', '1', 0.0, 'g25', -60.831), ('2015', '1', '1', 0.0, 'g31', -48.234), ('2015', '1', '1', 0.0, 'r07', -6.243)] b=[('2015', '1', '1', 0.0, 'g06', '0.000'), ('2015', '1', '1', 0.0, 'g12', '0.000'), ('2015', '1', '1', 0.0, 'g14', '0.000'), ('2015', '1', '1', 0.0, 'g24', '0.000'), ('2015', '1', '1', 0.0, 'g25', '0.000'), ('2015', '1', '1', 0.0, 'g29', '0.000'), ('2015', '1', '1', 0.0, 'g31', '0.000')] dt=[('a', 's4'), ('b', 's1'), ('c', 's1'), ('d',float), ('e', 's3'), ('f',float)] aa=np.array(a,dt) ab=np.array(b,dt) flds=list('abcde') numpy.lib import recfunctions mrgd = recfunctions.join_by(flds, aa, ab, usemask=false) print(mrgd) print(mrgd.dtype)
producing
[('2015', '1', '1', 0.0, 'g06', 46.29, 0.0) ('2015', '1', '1', 0.0, 'g12', 444.344, 0.0) ('2015', '1', '1', 0.0, 'g14', -99.269, 0.0) ('2015', '1', '1', 0.0, 'g24', 158.488, 0.0) ('2015', '1', '1', 0.0, 'g25', -60.831, 0.0) ('2015', '1', '1', 0.0, 'g31', -48.234, 0.0)] [('a', 's4'), ('b', 's1'), ('c', 's1'), ('d', '<f8'), ('e', 's3'), ('f1', '<f8'), ('f2', '<f8')]
in current organization recfunctions
have imported separately. https://stackoverflow.com/a/33680606/901925
we'd have examine code see how implemented. , don't know, without further timing, how speed compares equivalent pandas
.
with small sample, recfunctions
faster pandas
, if time required create dataframes included.
in [302]: %%timeit .....: = pd.dataframe(a) .....: b = pd.dataframe(b) .....: c = pd.merge(a, b, 'inner', left_on=[0,1,2,3,4], right_on=[0,1,2,3,4]) .....: 100 loops, best of 3: 8.01 ms per loop in [303]: %%timeit .....: aa=np.array(a,dt) .....: ab=np.array(b,dt) .....: ac=recfunctions.join_by(flds, aa, ab,usemask=false) .....: 100 loops, best of 3: 3.35 ms per loop
both slow compared numpy set operations in1d
(which don't attempt merging):
in [308]: timeit np.intersect1d(aa[flds],ab[flds]) 1000 loops, best of 3: 326 µs per loop
Comments
Post a Comment