r - Efficient way to filter one data frame by ranges in another -


let's have data frame containing bunch of data , date/time column indicating when each data point collected. have data frame lists time spans, "start" column indicates date/time when each span starts , "end" column indicates date/time when each span ends.

i've created dummy example below using simplified data:

main_data = data.frame(day=c(1:30))  spans_to_filter =      data.frame(span_number = c(1:6),                start = c(2,7,1,15,12,23),                end = c(5,10,4,18,15,26)) 

i toyed around few ways of solving problem , ended following solution:

require(dplyr)     filtered.main_data =     main_data %>%      rowwise() %>%      mutate(present = any(day >= spans_to_filter$start & day <= spans_to_filter$end)) %>%      filter(present) %>%      data.frame() 

this works fine, noticed can take while process if have lot of data (i assume because i'm performing row-wise comparison). i'm still learning ins-and-outs of r , wondering if there more efficient way of performing operation, preferably using dplyr/tidyr?

here's function can run in dplyr find dates within given range using between function (from dplyr). each value of day, mapply runs between on each of pairs of start , end dates , function uses rowsums return true if day between @ least 1 of them. i'm not sure if it's efficient approach, results in factor of 4 improvement in speed.

test.overlap = function(vals) {   rowsums(mapply(function(a,b) between(vals, a, b),                   spans_to_filter$start, spans_to_filter$end)) > 0 }  main_data %>%    filter(test.overlap(day)) 

if you're working dates (rather date-times), may more efficient create vector of specific dates , test membership (this might better approach date-times):

filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["start"]:a["end"]))  main_data %>%    filter(day %in% filt.vals) 

now compare execution speeds. shortened code require filtering operation:

library(microbenchmark)  microbenchmark(   op=main_data %>%      rowwise() %>%      filter(any(day >= spans_to_filter$start & day <= spans_to_filter$end)),   eipi10 = main_data %>%      filter(test.overlap(day)),   eipi10_2 = main_data %>%      filter(day %in% filt.vals)   )  unit: microseconds      expr      min       lq      mean    median       uq      max neval cld        op 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481   100   c    eipi10  658.941  686.933  782.8840  714.4440  770.679 2474.941   100  b   eipi10_2  579.338  601.355  655.1451  619.2595  672.535 1032.145   100    

update: below test larger data frame , few date ranges match (thanks @frank suggesting in now-deleted comment). turns out speed gains far greater in case (about factor of 200 mapply/between method, , far greater still second method).

main_data = data.frame(day=c(1:100000))  spans_to_filter =    data.frame(span_number = c(1:9),              start = c(2,7,1,15,12,23,90,9000,50000),              end = c(5,10,4,18,15,26,100,9100,50100))  microbenchmark(   op=main_data %>%      rowwise() %>%      filter(any(day >= spans_to_filter$start & day <= spans_to_filter$end)),   eipi10 = main_data %>%      filter(test.overlap(day)),   eipi10_2 = {     filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["start"]:a["end"]))     main_data %>%        filter(day %in% filt.vals)},    times=10   )  unit: milliseconds      expr         min          lq        mean      median          uq         max neval cld        op 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648    10   b    eipi10   24.209111   25.434856   29.526571   26.455813   32.051920   48.277326    10    eipi10_2    2.505509    2.618668    4.037414    2.892234    6.222845    8.266612    10   

Comments