here description of problem:
i have large number of small log files in directory, assuming:
- all files follow naming convention:
yyyy-mm-dd.log
, example: 2013-01-01.log, 2013-01-02.log . - there 1,000,000 small files.
- the combined size files several terabytes.
now have prepend line number each line in each file, , line number cumulative, spreading amongst files(files ordered timestamp) in folder. example:
- in 2013-01-01.log, line number 1~2500
- in 2013-01-02.log, line number 2501~7802
- ...
- in 2016-03-26.log, line number 1590321~3280165
all files overwritten include line number.
the constrains are:
- the storage device ssd , can handle multiple io requests simultaneously.
- the cpu powerful enough.
- the total memory can use 100mb.
- try maximize performance of application.
- implement , test in java.
after thinking , searching, here best solution i've thought of. the code little long, give brief description of each step:
count number of lines of each file concurrently , save mapping
concurrentskiplistmap
, key file name, value number of lines of file, , key ordered.count start line number of each file traversing
concurrentskiplistmap
, example, start line number , line count of 2013-01-01.log 1 , 1500 respectively, start line number of 2013-01-02.log 1501.prepend line number each line of each file: read line line of each file using
bufferedreader
, prepend line number , write corresponding tmp file usingbufferedwriter
. create thread pool , process concurrently.rename tmp files original name concurrently using thread pool.
i've tested program on mbp, step 1 , step 3 bottlenecks expected. have better solution, or optimization of solution? in advance!
not sure if questions fits model of q&a, try hints towards answer.
fact 1) given 1m files , 100mb limit, there no way keep information files in memory @ same time. except potentially doing lot of bit fiddling in old days when programmed in c.
fact 2) don't see way around reading files once count line numbers , rewrite them all, means read them again.
a) homework question? there may way produce file names folder lazily, 1 one, in java 7 or 8, not aware of it. if there is, use it. if not, might need generate file names instead of listing them. require can insert start , end date input. not sure if possible.
b) given there lazy iterator<file>
, whether jdk list files or self implemented generate file names, n of them partition work n threads.
c) each thread takes care of slice of files, reads them , keeps total number of lines of slice.
d) totals each slice compute starting number each slice.
e) distribute iterators on n threads again line numbering. rename tmp file after written, don't wait finish not having iterate on files again.
at each point in time, information kept in memory rather small: 1 file name per thread, line count on whole slice, current line of file being read. 100mb more enough this, if n not outrageously large.
edit: some say files.find()
lazily populated, yet not find code behind (some directorystream
in java 8) see if lazyness pertains read full contents of 1 folder @ time, or whether indeed 1 file name read @ time. or whether depends on file system used.
Comments
Post a Comment