I've got a csv file in something of an entity-attribute-value format (i.e., my event_id
is non-unique and repeats k times for the k associated attributes):
event_id, attribute_id, value
1, 1, a
1, 2, b
1, 3, c
2, 1, a
2, 2, b
2, 3, c
2, 4, d
Are there any handy tricks to transform a variable number of attributes (i.e., rows) into columns? The key here is that the output ought to be an m x n table of structured data, where m = max(k); filling in missing attributes with NULL
would be optimal:
event_id, 1, 2, 3, 4
1, a, b, c, null
2, a, b, c, d
My plan was to (1) convert the csv to a JSON object that looks like this:
data = [{'value': 'a', 'id': '1', 'event_id': '1', 'attribute_id': '1'},
{'value': 'b', 'id': '2', 'event_id': '1', 'attribute_id': '2'},
{'value': 'a', 'id': '3', 'event_id': '2', 'attribute_id': '1'},
{'value': 'b', 'id': '4', 'event_id': '2', 'attribute_id': '2'},
{'value': 'c', 'id': '5', 'event_id': '2', 'attribute_id': '3'},
{'value': 'd', 'id': '6', 'event_id': '2', 'attribute_id': '4'}]
(2) extract unique event ids:
events = set()
for item in data:
events.add(item['event_id'])
(3) create a list of lists, where each inner list is a list the of attributes for the corresponding parent event.
attributes = [[k['value'] for k in j] for i, j in groupby(data, key=lambda x: x['event_id'])]
(4) create a dictionary that brings events and attributes together:
event_dict = dict(zip(events, attributes))
which looks like this:
{'1': ['a', 'b'], '2': ['a', 'b', 'c', 'd']}
I'm not sure how to get all inner lists to be the same length with NULL
values populated where necessary. It seems like something that needs to be done in step (3). Also, creating n lists full of m NULL
values had crossed my mind, then iterate through each list and populate the value using attribute_id
as the list location; but that seems janky.
Your basic idea seems right, though I would implement it as follows:
import itertools
import csv
events = {} # we're going to keep track of the events we read in
with open('path/to/input') as infile:
for event, _att, val in csv.reader(infile):
if event not in events:
events[event] = []
events[int(event)].append(val) # track all the values for this event
maxAtts = max(len(v) for _k,v in events.items()) # the maximum number of attributes for any event
with open('path/to/output', 'w') as outfile):
writer = csv.writer(outfile)
writer.writerow(["event_id"] + list(range(1, maxAtts+1))) # write out the header row
for k in sorted(events): # let's look at the events in sorted order
writer.writerow([k] + events[k] + ['null']*(maxAtts-len(events[k]))) # write out the event id, all the values for that event, and pad with "null" for any attributes without values