i have huge excel files, i'm stymied "modest" 1 (50 mbytes). need skip on first 2 rows, don't think that's slowdown. can think of else?
wb = load_workbook(myfile,read_only=true) ws = wb.active ndepth = ws.max_row-2 ntime = ws.max_column -1 local_depth = np.zeros((ndepth,)) local_temp = np.zeros((ndepth,ntime)) iterlist = islice(ws.iter_rows(),2,none) start = time.time() i=0 row in iterlist: local_depth[i] = row[0].value j=0 col in row[1:]: local_temp[i,j] = col.value j += 1 += 1 print "done", time.time()-start
it's taking me on 7 minutes load file on m4700 dell precision. 8000 rows , 800 columns. surely there must wrong? there other tweaks somewhere should making in python 2.7 set-up?
thank, john
i try pandas task. it's simple , gives lots of power.
here small example:
import time import numpy np import pandas pd # let's generate sample data (8000 rows, 800 columns) data = np.random.randint(0, 100, (8000, 800)) # let's generate column names 'col001' 'col800' cols = ['col{0:02d}'.format(i) in range(1,801)] # generating pandas data frame numpy array df = pd.dataframe(data, columns=cols) # write generated df (data frame) excel file df.to_excel(r'd:/temp/sample.xlsx', index=false) # done sample data ##################################################################### # # interesting part starts here ... # ##################################################################### start = time.time() # read excel file (skipping first 2 rows) df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2) print "done", time.time()-start # print shape of out df print(df.shape)
output (from ipython, on home notebook took approx. 2 minutes):
in [24]: %paste start = time.time() # read excel file df = pd.read_excel(r'd:/temp/sample.xlsx', skiprows=2) print "done", time.time()-start ## -- end pasted text -- done 124.375999928 in [25]: in [25]: df.shape out[25]: (7998, 800) in [26]: # print shape of out df in [27]: print(df.shape) (7998, 800)
now have data in memory df , can process comfortable using panda's power
ps need install following python modules:
- numpy
- pandas
- xlrd
- openpyxl or xlsxwriter (or both)
Comments
Post a Comment