i'm getting strange behaviour data.table
in r. want keep subset of rows, e.g., dt <- dt[max.seq == 1]
, (i thought) worked fine in past. particular data set don't know if it's code or data.table
functionality i've misunderstood.
it seems command remove rows don't want needs run twice work properly.
specifically, i'm trying remove non-sequential firm-level time series keeping longest continuous sequence each firm (or recent sequence if there multiple maximal length sequences).
========
here's subset of data i'm using:
library(data.table) dt <- data.table( gvkey = c(7221, 7221, 7221, 7221, 7221, 7221, 7221, 7221, 7392, 7392, 7392, 7392, 7392, 7392, 7392, 7392, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 8344, 10589, 10589, 10589, 10589, 11759, 11759, 12675, 12675, 12675, 12675, 12675, 12675, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 1312, 13910, 13910, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 17286, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2090, 2212, 2212, 2212), fyear = c(1982, 1983, 1984, 1985, 1990, 1991, 1992, 1993, 1975, 1976, 1977, 1978, 1983, 1984, 1985, 1986, 1982, 1983, 1984, 1985, 1986, 1987, 1990, 1991, 1992, 1993, 1994, 1995, 1978, 1979, 1983, 1984, 1984, 1988, 1985, 1986, 1987, 2001, 2002, 2003, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1986, 1989, 1989, 1990, 1991, 1992, 1993, 1994, 2001, 2002, 2003, 2004, 2005, 2006, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1982, 1983, 1984)) setkey(dt, gvkey)
===========
i run following commands create binary variable (max.seq
) 1 each row corresponding each firm's (i.e., gvkey
) longest, , again one.segment
keep recent sequence necessary.
dt[, fyear.lag := shift(fyear, n=1l, type = "lag"), = gvkey] dt[, gap := fyear - fyear.lag] dt[, step.idx := 0] # initialize dt[gap >=2, step.idx := 1] # 1's @ each multi-year jump dt[, step.idx := cumsum(step.idx), = gvkey] # indexes each sequence firm dt[ , seq.lengths := .n, by=.(gvkey,step.idx)] # length of each sequence dt[, max.seq := max(seq.lengths), = gvkey] # each firm's longest sequence dt <- dt[max.seq == seq.lengths] # keep longest sequence(s)
now not efficient method since make copy above when removing non-longest time series, , again below when keep on recent time series of equal-length maximum series -- don't think should affect functionality issue i'm having.
dt[, one.segment := 1*(max.seq == .n), by= gvkey] # 0 if there multiple series remain dt[one.segment == 0, # make last max.seq elements 1, leave rest 0 one.segment := c(rep(0, (.n-max.seq[1])), rep(1, max.seq[1])), by=gvkey]
edited report full output
i start with
nrow(dt) # [1] 98 dt[one.segment ==0, .n] # [1] 14
then keep one.segment==1
rows.
dt.out <- dt[one.segment == 1] # finished! ... or i?
i should have no one.segment == 0
cases left, do.
nrow(dt.out) # [1] 76 dt.out[one.segment ==0, .n] # [1] 13
but if run row removal command again problem solved (both example , full data set nrow(dt)>35000
).
dt.out2 <- dt.out[one.segment == 1] nrow(dt.out2) # [1] 63 dt.out[one.segment ==0, .n] # [1] 0
what missing?
thanks!
** output **
> dt.out gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment 1: 1312 1974 na na 0 13 13 1 2: 1312 1975 1974 1 0 13 13 1 3: 1312 1976 1975 1 0 13 13 1 4: 1312 1977 1976 1 0 13 13 1 5: 1312 1978 1977 1 0 13 13 1 6: 1312 1979 1978 1 0 13 13 1 7: 1312 1980 1979 1 0 13 13 1 8: 1312 1981 1980 1 0 13 13 1 9: 1312 1982 1981 1 0 13 13 1 10: 1312 1983 1982 1 0 13 13 1 11: 1312 1984 1983 1 0 13 13 1 12: 1312 1985 1984 1 0 13 13 1 13: 1312 1986 1985 1 0 13 13 1 14: 2090 1956 na na 0 28 28 1 15: 2090 1957 1956 1 0 28 28 1 16: 2090 1958 1957 1 0 28 28 1 17: 2090 1959 1958 1 0 28 28 1 18: 2090 1960 1959 1 0 28 28 1 19: 2090 1961 1960 1 0 28 28 1 20: 2090 1962 1961 1 0 28 28 1 21: 2090 1963 1962 1 0 28 28 1 22: 2090 1964 1963 1 0 28 28 1 23: 2090 1965 1964 1 0 28 28 1 24: 2090 1966 1965 1 0 28 28 1 25: 2090 1967 1966 1 0 28 28 1 26: 2090 1968 1967 1 0 28 28 1 27: 2090 1969 1968 1 0 28 28 1 28: 2090 1970 1969 1 0 28 28 1 29: 2090 1971 1970 1 0 28 28 1 30: 2090 1972 1971 1 0 28 28 1 31: 2090 1973 1972 1 0 28 28 1 32: 2090 1974 1973 1 0 28 28 1 33: 2090 1975 1974 1 0 28 28 1 34: 2090 1976 1975 1 0 28 28 1 35: 2090 1977 1976 1 0 28 28 1 36: 2090 1978 1977 1 0 28 28 1 37: 2090 1979 1978 1 0 28 28 1 38: 2090 1980 1979 1 0 28 28 1 39: 2090 1981 1980 1 0 28 28 1 40: 2090 1982 1981 1 0 28 28 1 41: 2090 1983 1982 1 0 28 28 1 42: 2212 1982 na na 0 3 3 1 43: 2212 1983 1982 1 0 3 3 1 44: 2212 1984 1983 1 0 3 3 1 45: 8344 1990 1987 3 1 6 6 1 46: 8344 1991 1990 1 1 6 6 1 47: 8344 1992 1991 1 1 6 6 1 48: 8344 1993 1992 1 1 6 6 1 49: 8344 1994 1993 1 1 6 6 1 50: 8344 1995 1994 1 1 6 6 1 51: 10589 1978 na na 0 2 2 0 52: 10589 1979 1978 1 0 2 2 0 53: 10589 1983 1979 4 1 2 2 1 54: 10589 1984 1983 1 1 2 2 1 55: 11759 1984 na na 0 1 1 0 56: 11759 1988 1984 4 1 1 1 1 57: 12675 1985 na na 0 3 3 0 58: 12675 1986 1985 1 0 3 3 0 59: 12675 1987 1986 1 0 3 3 0 60: 12675 2001 1987 14 1 3 3 1 61: 12675 2002 2001 1 1 3 3 1 62: 12675 2003 2002 1 1 3 3 1 63: 13910 1986 na na 0 1 1 0 64: 13910 1989 1986 3 1 1 1 1 65: 17286 1989 na na 0 6 6 0 66: 17286 1990 1989 1 0 6 6 0 67: 17286 1991 1990 1 0 6 6 0 68: 17286 1992 1991 1 0 6 6 0 69: 17286 1993 1992 1 0 6 6 0 70: 17286 1994 1993 1 0 6 6 0 71: 17286 2001 1994 7 1 6 6 1 72: 17286 2002 2001 1 1 6 6 1 73: 17286 2003 2002 1 1 6 6 1 74: 17286 2004 2003 1 1 6 6 1 75: 17286 2005 2004 1 1 6 6 1 76: 17286 2006 2005 1 1 6 6 1 gvkey fyear fyear.lag gap step.idx seq.lengths max.seq one.segment
** session info ***
> sessioninfo() r version 3.2.3 (2015-12-10) platform: x86_64-apple-darwin13.4.0 (64-bit) running under: os x 10.11.4 (el capitan) locale: [1] en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf-8/en_us.utf-8 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] data.table_1.9.6 loaded via namespace (and not attached): [1] tools_3.2.3 chron_2.3-47
Comments
Post a Comment