Search code examples
pythoncsvheader

Read csv with duplicate column headers in Python


I want to read a csv as a dictionary in Python, but now I encountered a problem, because the csv contains headers used more than once, like this:

id name labels labels
01 one mytask myproduct
02 two mylabel

The standard way to import a csv to python looks like this:

# import csv
import csv
# read csv file to a list of dictionaries
with open('data.csv', 'r') as file:
    csv_reader = csv.DictReader(file)
    data = [row for row in csv_reader]
print(data)

Sadly this code swallows up the first 'labels' value if there is more than one. This code outputs this:

[
    {'id': '01', 'name': 'one', 'labels': 'myproduct'},
    {'id': '02', 'name': 'two', 'labels': 'mylabel'},
]

Is there any way to read also the second value for 'labels' without getting to complicated? My preferred output would look like this:

[
    {'id': '01', 'name': 'one', 'labels': ['mytask', 'myproduct']},
    {'id': '02', 'name': 'two', 'labels': 'mylabel'},
]

Solution

  • You could define your own reader which correctly handles files where duplicate headers exist.

    When you initialize the reader, parse the first row to map all header names to the corresponding column indices. When you read the rows, iterate through this map, and assign the values to the correct keys of an output dictionary based on the header name-to-column index map you created earlier.

    import csv
    
    class DHDictReader():
        def __init__(self, iterable, dialect='excel', **kwargs):
            self._reader = csv.reader(iterable, dialect, **kwargs)
            
            # Get header row
            headers = next(self._reader)
    
            # Map header names to column indices
            self._row_cols = {}
            for i, h in enumerate(headers):
                if h in self._row_cols:
                    self._row_cols[h].append(i)
                else:
                    self._row_cols[h] = [i]
        
        # The iterator for this object is itself -- 
        # calling __next__ on this object yields the next record
        def __iter__(self):
            return self
    
        def __next__(self):
            # Get next row
            row = next(self._reader)
            out_row = {}
            for header, col_indices in self._row_cols.items():
                # Create list containing all non-empty values for this header
                out_row[header] = [row[i] for i in col_indices if i < len(row) and row[i]]
    
                # If this header contains only one value, change it from a list to the first element of the list
                if len(out_row[header]) == 1:
                    out_row[header] = out_row[header][0]
    
            return out_row
        
        @property
        def dialect(self):
            return self._reader.dialect
        
        @property
        def line_num(self):
            return self._reader.line_num
        
        @property
        def fieldnames(self):
            return self._row_cols.keys()
    

    The properties are just to expose attributes that other reader objects from the csv module also have

    Now, this class should be a drop-in replacement for a regular DictReader:

    file_contents = """id,name,labels,labels
    01,one,mytask,myproduct
    02,two,mylabel
    """
    
    with open('my_file.csv', 'w') as f:
        f.write(file_contents)
    
    with open('my_file.csv') as f:
        reader = DHDictReader(f)
        print(list(reader))
    

    will list out the records in the way you want:

    [
      {'id': '01', 'name': 'one', 'labels': ['mytask', 'myproduct']}, 
      {'id': '02', 'name': 'two', 'labels': 'mylabel'}
    ]