java - How to process a large number of small files in limited memory? -


here description of problem:

i have large number of small log files in directory, assuming:

  1. all files follow naming convention: yyyy-mm-dd.log, example: 2013-01-01.log, 2013-01-02.log .
  2. there 1,000,000 small files.
  3. the combined size files several terabytes.

now have prepend line number each line in each file, , line number cumulative, spreading amongst files(files ordered timestamp) in folder. example:

  • in 2013-01-01.log, line number 1~2500
  • in 2013-01-02.log, line number 2501~7802
  • ...
  • in 2016-03-26.log, line number 1590321~3280165

all files overwritten include line number.

the constrains are:

  1. the storage device ssd , can handle multiple io requests simultaneously.
  2. the cpu powerful enough.
  3. the total memory can use 100mb.
  4. try maximize performance of application.
  5. implement , test in java.

after thinking , searching, here best solution i've thought of. the code little long, give brief description of each step:

  1. count number of lines of each file concurrently , save mapping concurrentskiplistmap, key file name, value number of lines of file, , key ordered.

  2. count start line number of each file traversing concurrentskiplistmap, example, start line number , line count of 2013-01-01.log 1 , 1500 respectively, start line number of 2013-01-02.log 1501.

  3. prepend line number each line of each file: read line line of each file using bufferedreader, prepend line number , write corresponding tmp file using bufferedwriter. create thread pool , process concurrently.

  4. rename tmp files original name concurrently using thread pool.

i've tested program on mbp, step 1 , step 3 bottlenecks expected. have better solution, or optimization of solution? in advance!

not sure if questions fits model of q&a, try hints towards answer.

fact 1) given 1m files , 100mb limit, there no way keep information files in memory @ same time. except potentially doing lot of bit fiddling in old days when programmed in c.

fact 2) don't see way around reading files once count line numbers , rewrite them all, means read them again.

a) homework question? there may way produce file names folder lazily, 1 one, in java 7 or 8, not aware of it. if there is, use it. if not, might need generate file names instead of listing them. require can insert start , end date input. not sure if possible.

b) given there lazy iterator<file>, whether jdk list files or self implemented generate file names, n of them partition work n threads.

c) each thread takes care of slice of files, reads them , keeps total number of lines of slice.

d) totals each slice compute starting number each slice.

e) distribute iterators on n threads again line numbering. rename tmp file after written, don't wait finish not having iterate on files again.

at each point in time, information kept in memory rather small: 1 file name per thread, line count on whole slice, current line of file being read. 100mb more enough this, if n not outrageously large.

edit: some say files.find() lazily populated, yet not find code behind (some directorystream in java 8) see if lazyness pertains read full contents of 1 folder @ time, or whether indeed 1 file name read @ time. or whether depends on file system used.


Comments