python - openpyxl readonly use_iterators -


i have huge excel files, i'm stymied "modest" 1 (50 mbytes). need skip on first 2 rows, don't think that's slowdown. can think of else?

wb = load_workbook(myfile,read_only=true) ws = wb.active  ndepth = ws.max_row-2 ntime = ws.max_column -1  local_depth = np.zeros((ndepth,)) local_temp = np.zeros((ndepth,ntime))  iterlist = islice(ws.iter_rows(),2,none)  start = time.time()  i=0 row in iterlist:     local_depth[i] = row[0].value      j=0     col in row[1:]:         local_temp[i,j] = col.value         j += 1     += 1  print "done", time.time()-start 

it's taking me on 7 minutes load file on m4700 dell precision. 8000 rows , 800 columns. surely there must wrong? there other tweaks somewhere should making in python 2.7 set-up?

thank, john

i try pandas task. it's simple , gives lots of power.

here small example:

import time import numpy np import pandas pd  # let's generate sample data (8000 rows, 800 columns) data = np.random.randint(0, 100, (8000, 800))  # let's generate column names 'col001' 'col800' cols = ['col{0:02d}'.format(i) in range(1,801)]  # generating pandas data frame numpy array df = pd.dataframe(data, columns=cols)  # write generated df (data frame) excel file df.to_excel(r'd:/temp/sample.xlsx', index=false) # done sample data  ##################################################################### # # interesting part starts here ... # #####################################################################  start = time.time()  # read excel file (skipping first 2 rows) df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2)  print "done", time.time()-start  # print shape of out df print(df.shape) 

output (from ipython, on home notebook took approx. 2 minutes):

in [24]: %paste start = time.time()  # read excel file df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2)  print "done", time.time()-start ## -- end pasted text -- done 124.375999928  in [25]:  in [25]: df.shape out[25]: (7998, 800)  in [26]: # print shape of out df  in [27]: print(df.shape) (7998, 800) 

now have data in memory df , can process comfortable using panda's power

ps need install following python modules:

  • numpy
  • pandas
  • xlrd
  • openpyxl or xlsxwriter (or both)

Comments