r - How can I find length of a column in SparkR -


i converting plain r code sparkr make efficient use of spark.

i have below column closedate.

closedate 2011-01-08 2011-02-07 2012-04-07 2013-04-18 2011-02-07 2010-11-10 2010-12-09 2013-02-18 2010-12-09 2011-03-11 2011-04-10 2013-06-19 2011-04-10 2011-01-06 2011-02-06 2013-04-16 2011-02-06 2015-09-25 2015-09-25 2010-11-10 

i want count number of time date has been increased|decreased. have below r code that.

datechange <- function(closedate, dir){   close_dt <- as.date(closedate)   num_closedt_out = 0   num_closedt_in = 0    for(j in 1:length(close_dt))    {     curr <- close_dt[j]     if (j > 1)       prev <- close_dt[j-1]     else        prev <- curr     if (curr > prev){       num_closedt_out = num_closedt_out + 1     }     else if (curr < prev){       num_closedt_in = num_closedt_in + 1     }   }   if (dir=="inc")     ret <- num_closedt_out   else if (dir=="dec")     ret <- num_closedt_in   ret }  

i tried use sparkr df$col here. since spark lazily executes code, didn't value of length during execution , getting nan error.

here modified code tried.

datedirchanges <- function(closedate, dir){   close_dt <- to_date(closedate)   num_closedt_out = 0   num_closedt_in = 0    col_len <- sparkr::count(close_dt)   for(j in 1:col_len)    {     curr <- close_dt[j]     if (j > 1)       prev <- close_dt[j-1]     else        prev <- curr     if (curr > prev){       num_closedt_out = num_closedt_out + 1     }     else if (curr < prev){       num_closedt_in = num_closedt_in + 1     }   }   if (dir=="inc")     ret <- num_closedt_out   else if (dir=="dec")     ret <- num_closedt_in   ret } 

how can length of column during execution of code? or there other better it?

you cannot because column has no length. unlike may expect in r columns don't represent data sql expressions , specific data transformations. order of values in spark dataframe arbitrary cannot around.

if data can partitioned in previous question can use window functions in same may i've shown in answer previous question. otherwise there no efficient way handle using sparkr alone.

assuming there way determine order (required) , can partition data (desired reasonable performance) need this:

select    cast(lag(closedate, 1) on w > closedate int) gt,    cast(lag(closedate, 1) on w < closedate int) lt,    cast(lag(closedate, 1) on w = closedate int) eq df window w (   partition partition_col order order_col ) 

Comments