I have a function that accepts four types of files:
.csv
.xls
.xlsx
.json
The files can contain up to a couple thousand rows of data. I want to get the column headers for the file, and N
example rows of data (doesn't need to be the first N, but need to be a total of N). An example:
From animals.csv
:
To:
animals = {
"dogs": [1, 5, 8],
"cats": [2, 6, 9],
"birds": [3, 10, 14],
"frogs": [4, 8, 11]
}
What's the most efficient way of doing this?
Test files:
Here's my solution. I don't claim for it to be the "most efficient" though. You'll need to install xlrd (pip3 install xlrd
). The test files are available in the question description.
import collections
import pathlib
import csv
import json
import xlrd
file = "animals.csv"
f_sufx = pathlib.Path(file).suffix
if f_sufx == ".csv":
with open(file, 'r') as f:
reader = csv.DictReader(f)
d = collections.defaultdict(set)
for r in reader:
for k, v in r.items():
if len(d[k]) < 3 and v:
d[k].add(v)
elif f_sufx == ".json":
with open(file, 'r') as f:
d = collections.defaultdict(set)
for r in json.load(f):
for k, v in r.items():
if len(d[k]) < 3 and v:
d[k].add(v)
elif f_sufx in [".xls", ".xlsx"]:
d = collections.defaultdict(set)
sh = xlrd.open_workbook(file).sheet_by_index(0)
for row in range(2, sh.nrows):
for col in range(sh.ncols):
if len(d[sh.cell_value(1, col)]) < 3 and sh.cell_value(row, col):
d[sh.cell_value(1, col)].add(sh.cell_value(row, col))
print(d)