I have a huge dump of records in a file.
Filename Col1 Col2 Col3 Col4
SE01_inf name [] NA []
SE01_loc NA loc NA []
SE01_id NA [] 123 []
SE01_1_inf name1 [] NA []
SE01_1_loc NA loc NA []
I want a consolidated output like below
Filename Col1 Col2 Col3 Col4
SE01 name loc 123 []
SE01_1 name1 loc NA []
I do not want to do it in excel as the data is huge and excel gets stuck the moment I write a function. Can I achieve this using python, I am not very clear on how to start.
How huge is the data? If memory isn't a problem and you have the data in a list this works for you example input:
input = [['SE01_inf', 'name', [], 'NA', []],\
['SE01_loc', 'NA', 'loc', 'NA', []],\
['SE01_id', 'NA', [], '123', []],\
['SE01_1_inf', 'name1', [], 'NA', []],\
['SE01_1_loc', 'NA', 'loc', 'NA', []]]
output = {}
for row in input:
id = row[0][:row[0].rfind('_')]
if id not in output:
output[id] = [id] + row[1:]
else:
output[id] = [new if old in ('NA', []) else old for new, old in zip(row, output[id])]
print output.values()
Produces the output:
[['SE01', 'name', 'loc', '123', []], ['SE01_1', 'name1', 'loc', 'NA', []]]