r - Selected rows in data.table not being removed first time (must remove twice) -


i'm getting strange behaviour data.table in r. want keep subset of rows, e.g., dt <- dt[max.seq == 1], (i thought) worked fine in past. particular data set don't know if it's code or data.table functionality i've misunderstood.

it seems command remove rows don't want needs run twice work properly.

specifically, i'm trying remove non-sequential firm-level time series keeping longest continuous sequence each firm (or recent sequence if there multiple maximal length sequences).

========

here's subset of data i'm using:

library(data.table) dt <- data.table(        gvkey =  c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392,                    7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344,                    8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675,                    12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312,                    1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286,                    17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,                    2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090,                    2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212),        fyear =  c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983,                    1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993,                    1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002,                    2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985,                    1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004,                    2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,                    1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,                    1980, 1981, 1982, 1983, 1982, 1983, 1984))  setkey(dt, gvkey) 

===========

i run following commands create binary variable (max.seq) 1 each row corresponding each firm's (i.e., gvkey) longest, , again one.segment keep recent sequence necessary.

dt[, fyear.lag := shift(fyear, n=1l, type = "lag"), = gvkey] dt[, gap := fyear - fyear.lag]  dt[,  step.idx := 0]    # initialize dt[gap >=2, step.idx := 1]    # 1's @ each multi-year jump dt[,        step.idx := cumsum(step.idx), = gvkey] # indexes each sequence firm dt[ ,  seq.lengths := .n,  by=.(gvkey,step.idx)]      # length of each sequence dt[,   max.seq := max(seq.lengths), = gvkey]       # each firm's longest sequence  dt <- dt[max.seq == seq.lengths]  # keep longest sequence(s) 

now not efficient method since make copy above when removing non-longest time series, , again below when keep on recent time series of equal-length maximum series -- don't think should affect functionality issue i'm having.

dt[, one.segment := 1*(max.seq == .n), by= gvkey] # 0 if there multiple series remain  dt[one.segment == 0,  # make last max.seq elements 1, leave rest 0     one.segment := c(rep(0, (.n-max.seq[1])), rep(1, max.seq[1])), by=gvkey] 

edited report full output

i start with

 nrow(dt) # [1] 98  dt[one.segment ==0, .n] # [1] 14 

then keep one.segment==1 rows.

dt.out <- dt[one.segment == 1] # finished! ... or i? 

i should have no one.segment == 0 cases left, do.

 nrow(dt.out) # [1] 76  dt.out[one.segment ==0, .n] # [1] 13 

but if run row removal command again problem solved (both example , full data set nrow(dt)>35000).

dt.out2 <- dt.out[one.segment == 1] nrow(dt.out2)  # [1] 63 dt.out[one.segment ==0, .n]  # [1] 0 

what missing?

thanks!

** output **

> dt.out gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment  1:  1312  1974        na  na        0          13      13           1  2:  1312  1975      1974   1        0          13      13           1  3:  1312  1976      1975   1        0          13      13           1  4:  1312  1977      1976   1        0          13      13           1  5:  1312  1978      1977   1        0          13      13           1  6:  1312  1979      1978   1        0          13      13           1  7:  1312  1980      1979   1        0          13      13           1  8:  1312  1981      1980   1        0          13      13           1  9:  1312  1982      1981   1        0          13      13           1 10:  1312  1983      1982   1        0          13      13           1 11:  1312  1984      1983   1        0          13      13           1 12:  1312  1985      1984   1        0          13      13           1 13:  1312  1986      1985   1        0          13      13           1 14:  2090  1956        na  na        0          28      28           1 15:  2090  1957      1956   1        0          28      28           1 16:  2090  1958      1957   1        0          28      28           1 17:  2090  1959      1958   1        0          28      28           1 18:  2090  1960      1959   1        0          28      28           1 19:  2090  1961      1960   1        0          28      28           1 20:  2090  1962      1961   1        0          28      28           1 21:  2090  1963      1962   1        0          28      28           1 22:  2090  1964      1963   1        0          28      28           1 23:  2090  1965      1964   1        0          28      28           1 24:  2090  1966      1965   1        0          28      28           1 25:  2090  1967      1966   1        0          28      28           1 26:  2090  1968      1967   1        0          28      28           1 27:  2090  1969      1968   1        0          28      28           1 28:  2090  1970      1969   1        0          28      28           1 29:  2090  1971      1970   1        0          28      28           1 30:  2090  1972      1971   1        0          28      28           1 31:  2090  1973      1972   1        0          28      28           1 32:  2090  1974      1973   1        0          28      28           1 33:  2090  1975      1974   1        0          28      28           1 34:  2090  1976      1975   1        0          28      28           1 35:  2090  1977      1976   1        0          28      28           1 36:  2090  1978      1977   1        0          28      28           1 37:  2090  1979      1978   1        0          28      28           1 38:  2090  1980      1979   1        0          28      28           1 39:  2090  1981      1980   1        0          28      28           1 40:  2090  1982      1981   1        0          28      28           1 41:  2090  1983      1982   1        0          28      28           1 42:  2212  1982        na  na        0           3       3           1 43:  2212  1983      1982   1        0           3       3           1 44:  2212  1984      1983   1        0           3       3           1 45:  8344  1990      1987   3        1           6       6           1 46:  8344  1991      1990   1        1           6       6           1 47:  8344  1992      1991   1        1           6       6           1 48:  8344  1993      1992   1        1           6       6           1 49:  8344  1994      1993   1        1           6       6           1 50:  8344  1995      1994   1        1           6       6           1 51: 10589  1978        na  na        0           2       2           0 52: 10589  1979      1978   1        0           2       2           0 53: 10589  1983      1979   4        1           2       2           1 54: 10589  1984      1983   1        1           2       2           1 55: 11759  1984        na  na        0           1       1           0 56: 11759  1988      1984   4        1           1       1           1 57: 12675  1985        na  na        0           3       3           0 58: 12675  1986      1985   1        0           3       3           0 59: 12675  1987      1986   1        0           3       3           0 60: 12675  2001      1987  14        1           3       3           1 61: 12675  2002      2001   1        1           3       3           1 62: 12675  2003      2002   1        1           3       3           1 63: 13910  1986        na  na        0           1       1           0 64: 13910  1989      1986   3        1           1       1           1 65: 17286  1989        na  na        0           6       6           0 66: 17286  1990      1989   1        0           6       6           0 67: 17286  1991      1990   1        0           6       6           0 68: 17286  1992      1991   1        0           6       6           0 69: 17286  1993      1992   1        0           6       6           0 70: 17286  1994      1993   1        0           6       6           0 71: 17286  2001      1994   7        1           6       6           1 72: 17286  2002      2001   1        1           6       6           1 73: 17286  2003      2002   1        1           6       6           1 74: 17286  2004      2003   1        1           6       6           1 75: 17286  2005      2004   1        1           6       6           1 76: 17286  2006      2005   1        1           6       6           1 gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment 

** session info ***

> sessioninfo() r version 3.2.3 (2015-12-10) platform: x86_64-apple-darwin13.4.0 (64-bit) running under: os x 10.11.4 (el capitan)  locale: [1] en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf-8/en_us.utf-8  attached base packages: [1] stats     graphics  grdevices utils     datasets  methods   base       other attached packages: [1] data.table_1.9.6  loaded via namespace (and not attached): [1] tools_3.2.3  chron_2.3-47 


Comments