I am reading some data files stored as excel from online. My current process involves downloading the file to disk using the retrieve function defined below which uses the urllib2 library and then parses the excel document using the traverseWorkbook function. The traverse function uses the xlrd library for parsing the excel.
I would like to perform the same operation without requiring downloading the file to disk but will prefer to keep the file in memory and parse it memory.
Not sure how to even proceed, but I'm sure its possible.
def retrieveFile(url, filename):
try:
req = urllib2.urlopen(url)
CHUNK = 16 * 1024
with open(filename, 'wb') as fp:
while True:
chunk = req.read(CHUNK)
if not chunk: break
fp.write(chunk)
return True
except Exception, e:
return None
def traverseWorkbook(filename):
values = []
wb = open_workbook(filename)
for s in wb.sheets():
for row in range(s.nrows):
if row > 10:
rowData = processRow(s, row, type)
if rowData:
values.append(rowData)
You can read the entire file into memory using:
data = urllib2.urlopen(url).read()
Once the file is in memory, you can load it into xlrd
using the file_contents
argument of open_workbook
:
wb = xlrd.open_workbook(url, file_contents=data)
Pass the url in as the filename as the documentation states it might be used in messages; otherwise, it will be ignored.
Thus, your traverseWorbook
method can be rewritten as:
def traverseWorkbook(url):
values = []
data = urllib2.urlopen(url).read()
wb = xlrd.open_workbook(filename, file_contents=data)
for s in wb.sheets():
for row in range(s.nrows):
if row > 10:
rowData = processRow(s, row, type)
if rowData:
values.append(rowData)
return values