I want to read a csv as a dictionary in Python, but now I encountered a problem, because the csv contains headers used more than once, like this:
id | name | labels | labels |
---|---|---|---|
01 | one | mytask | myproduct |
02 | two | mylabel |
The standard way to import a csv to python looks like this:
# import csv
import csv
# read csv file to a list of dictionaries
with open('data.csv', 'r') as file:
csv_reader = csv.DictReader(file)
data = [row for row in csv_reader]
print(data)
Sadly this code swallows up the first 'labels' value if there is more than one. This code outputs this:
[
{'id': '01', 'name': 'one', 'labels': 'myproduct'},
{'id': '02', 'name': 'two', 'labels': 'mylabel'},
]
Is there any way to read also the second value for 'labels' without getting to complicated? My preferred output would look like this:
[
{'id': '01', 'name': 'one', 'labels': ['mytask', 'myproduct']},
{'id': '02', 'name': 'two', 'labels': 'mylabel'},
]
You could define your own reader which correctly handles files where duplicate headers exist.
When you initialize the reader, parse the first row to map all header names to the corresponding column indices. When you read the rows, iterate through this map, and assign the values to the correct keys of an output dictionary based on the header name-to-column index map you created earlier.
import csv
class DHDictReader():
def __init__(self, iterable, dialect='excel', **kwargs):
self._reader = csv.reader(iterable, dialect, **kwargs)
# Get header row
headers = next(self._reader)
# Map header names to column indices
self._row_cols = {}
for i, h in enumerate(headers):
if h in self._row_cols:
self._row_cols[h].append(i)
else:
self._row_cols[h] = [i]
# The iterator for this object is itself --
# calling __next__ on this object yields the next record
def __iter__(self):
return self
def __next__(self):
# Get next row
row = next(self._reader)
out_row = {}
for header, col_indices in self._row_cols.items():
# Create list containing all non-empty values for this header
out_row[header] = [row[i] for i in col_indices if i < len(row) and row[i]]
# If this header contains only one value, change it from a list to the first element of the list
if len(out_row[header]) == 1:
out_row[header] = out_row[header][0]
return out_row
@property
def dialect(self):
return self._reader.dialect
@property
def line_num(self):
return self._reader.line_num
@property
def fieldnames(self):
return self._row_cols.keys()
The properties are just to expose attributes that other reader objects from the csv
module also have
Now, this class should be a drop-in replacement for a regular DictReader
:
file_contents = """id,name,labels,labels
01,one,mytask,myproduct
02,two,mylabel
"""
with open('my_file.csv', 'w') as f:
f.write(file_contents)
with open('my_file.csv') as f:
reader = DHDictReader(f)
print(list(reader))
will list out the records in the way you want:
[
{'id': '01', 'name': 'one', 'labels': ['mytask', 'myproduct']},
{'id': '02', 'name': 'two', 'labels': 'mylabel'}
]