i converting plain r code sparkr make efficient use of spark.
i have below column closedate.
closedate 2011-01-08 2011-02-07 2012-04-07 2013-04-18 2011-02-07 2010-11-10 2010-12-09 2013-02-18 2010-12-09 2011-03-11 2011-04-10 2013-06-19 2011-04-10 2011-01-06 2011-02-06 2013-04-16 2011-02-06 2015-09-25 2015-09-25 2010-11-10
i want count number of time date has been increased|decreased. have below r code that.
datechange <- function(closedate, dir){ close_dt <- as.date(closedate) num_closedt_out = 0 num_closedt_in = 0 for(j in 1:length(close_dt)) { curr <- close_dt[j] if (j > 1) prev <- close_dt[j-1] else prev <- curr if (curr > prev){ num_closedt_out = num_closedt_out + 1 } else if (curr < prev){ num_closedt_in = num_closedt_in + 1 } } if (dir=="inc") ret <- num_closedt_out else if (dir=="dec") ret <- num_closedt_in ret }
i tried use sparkr df$col here. since spark lazily executes code, didn't value of length during execution , getting nan error.
here modified code tried.
datedirchanges <- function(closedate, dir){ close_dt <- to_date(closedate) num_closedt_out = 0 num_closedt_in = 0 col_len <- sparkr::count(close_dt) for(j in 1:col_len) { curr <- close_dt[j] if (j > 1) prev <- close_dt[j-1] else prev <- curr if (curr > prev){ num_closedt_out = num_closedt_out + 1 } else if (curr < prev){ num_closedt_in = num_closedt_in + 1 } } if (dir=="inc") ret <- num_closedt_out else if (dir=="dec") ret <- num_closedt_in ret }
how can length of column during execution of code? or there other better it?
you cannot because column
has no length. unlike may expect in r columns don't represent data sql expressions , specific data transformations. order of values in spark dataframe
arbitrary cannot around.
if data can partitioned in previous question can use window functions in same may i've shown in answer previous question. otherwise there no efficient way handle using sparkr alone.
assuming there way determine order (required) , can partition data (desired reasonable performance) need this:
select cast(lag(closedate, 1) on w > closedate int) gt, cast(lag(closedate, 1) on w < closedate int) lt, cast(lag(closedate, 1) on w = closedate int) eq df window w ( partition partition_col order order_col )
Comments
Post a Comment