let's have data frame containing bunch of data , date/time column indicating when each data point collected. have data frame lists time spans, "start" column indicates date/time when each span starts , "end" column indicates date/time when each span ends.
i've created dummy example below using simplified data:
main_data = data.frame(day=c(1:30)) spans_to_filter = data.frame(span_number = c(1:6), start = c(2,7,1,15,12,23), end = c(5,10,4,18,15,26))
i toyed around few ways of solving problem , ended following solution:
require(dplyr) filtered.main_data = main_data %>% rowwise() %>% mutate(present = any(day >= spans_to_filter$start & day <= spans_to_filter$end)) %>% filter(present) %>% data.frame()
this works fine, noticed can take while process if have lot of data (i assume because i'm performing row-wise comparison). i'm still learning ins-and-outs of r , wondering if there more efficient way of performing operation, preferably using dplyr/tidyr?
here's function can run in dplyr
find dates within given range using between
function (from dplyr
). each value of day
, mapply
runs between
on each of pairs of start
, end
dates , function uses rowsums
return true
if day
between @ least 1 of them. i'm not sure if it's efficient approach, results in factor of 4 improvement in speed.
test.overlap = function(vals) { rowsums(mapply(function(a,b) between(vals, a, b), spans_to_filter$start, spans_to_filter$end)) > 0 } main_data %>% filter(test.overlap(day))
if you're working dates (rather date-times), may more efficient create vector of specific dates , test membership (this might better approach date-times):
filt.vals = as.vector(apply(spans_to_filter, 1, function(a) a["start"]:a["end"])) main_data %>% filter(day %in% filt.vals)
now compare execution speeds. shortened code require filtering operation:
library(microbenchmark) microbenchmark( op=main_data %>% rowwise() %>% filter(any(day >= spans_to_filter$start & day <= spans_to_filter$end)), eipi10 = main_data %>% filter(test.overlap(day)), eipi10_2 = main_data %>% filter(day %in% filt.vals) ) unit: microseconds expr min lq mean median uq max neval cld op 2496.019 2618.994 2875.0402 2701.8810 2954.774 4741.481 100 c eipi10 658.941 686.933 782.8840 714.4440 770.679 2474.941 100 b eipi10_2 579.338 601.355 655.1451 619.2595 672.535 1032.145 100
update: below test larger data frame , few date ranges match (thanks @frank suggesting in now-deleted comment). turns out speed gains far greater in case (about factor of 200 mapply/between
method, , far greater still second method).
main_data = data.frame(day=c(1:100000)) spans_to_filter = data.frame(span_number = c(1:9), start = c(2,7,1,15,12,23,90,9000,50000), end = c(5,10,4,18,15,26,100,9100,50100)) microbenchmark( op=main_data %>% rowwise() %>% filter(any(day >= spans_to_filter$start & day <= spans_to_filter$end)), eipi10 = main_data %>% filter(test.overlap(day)), eipi10_2 = { filt.vals = unlist(apply(spans_to_filter, 1, function(a) a["start"]:a["end"])) main_data %>% filter(day %in% filt.vals)}, times=10 ) unit: milliseconds expr min lq mean median uq max neval cld op 5130.903866 5137.847177 5201.989501 5216.840039 5246.961077 5276.856648 10 b eipi10 24.209111 25.434856 29.526571 26.455813 32.051920 48.277326 10 eipi10_2 2.505509 2.618668 4.037414 2.892234 6.222845 8.266612 10
Comments
Post a Comment