Search code examples
pythonjsoncsvxlsxxls

Read headers and N rows of data


I have a function that accepts four types of files:

  • .csv
  • .xls
  • .xlsx
  • .json

The files can contain up to a couple thousand rows of data. I want to get the column headers for the file, and N example rows of data (doesn't need to be the first N, but need to be a total of N). An example:

From animals.csv:

animals.csv

To:

animals = {
    "dogs": [1, 5, 8],
    "cats": [2, 6, 9],
    "birds": [3, 10, 14],
    "frogs": [4, 8, 11]
}

What's the most efficient way of doing this?

Test files:


Solution

  • Here's my solution. I don't claim for it to be the "most efficient" though. You'll need to install xlrd (pip3 install xlrd). The test files are available in the question description.


    import collections
    import pathlib
    import csv
    import json
    import xlrd
    
    file = "animals.csv"
    
    f_sufx = pathlib.Path(file).suffix
    if f_sufx == ".csv":
        with open(file, 'r') as f:
            reader = csv.DictReader(f)
            d = collections.defaultdict(set)
            for r in reader:
                for k, v in r.items():
                    if len(d[k]) < 3 and v:
                        d[k].add(v)
    elif f_sufx == ".json":
        with open(file, 'r') as f:
            d = collections.defaultdict(set)
            for r in json.load(f):
                for k, v in r.items():
                    if len(d[k]) < 3 and v:
                        d[k].add(v)
    elif f_sufx in [".xls", ".xlsx"]:
        d = collections.defaultdict(set)
        sh = xlrd.open_workbook(file).sheet_by_index(0)  
        for row in range(2, sh.nrows):
            for col in range(sh.ncols):
                if len(d[sh.cell_value(1, col)]) < 3 and sh.cell_value(row, col):
                    d[sh.cell_value(1, col)].add(sh.cell_value(row, col))
    
    print(d)