Search code examples
pythonconsolidation

grouping by rows in python


I have a huge dump of records in a file.

Filename     Col1   Col2   Col3  Col4
SE01_inf     name    []     NA    []
SE01_loc      NA    loc     NA    []
SE01_id       NA     []    123    []
SE01_1_inf   name1   []     NA    []
SE01_1_loc    NA     loc    NA    []

I want a consolidated output like below

Filename     Col1   Col2   Col3  Col4
SE01         name   loc    123    []
SE01_1       name1  loc     NA    []

I do not want to do it in excel as the data is huge and excel gets stuck the moment I write a function. Can I achieve this using python, I am not very clear on how to start.


Solution

  • How huge is the data? If memory isn't a problem and you have the data in a list this works for you example input:

    input = [['SE01_inf', 'name', [], 'NA', []],\
             ['SE01_loc', 'NA', 'loc', 'NA', []],\
             ['SE01_id', 'NA', [], '123', []],\
             ['SE01_1_inf', 'name1', [], 'NA', []],\
             ['SE01_1_loc', 'NA', 'loc', 'NA', []]]
    
    output = {}
    for row in input:
        id = row[0][:row[0].rfind('_')]
        if id not in output:
            output[id] = [id] + row[1:]
        else:
            output[id] = [new if old in ('NA', []) else old for new, old in zip(row, output[id])]
    
    print output.values()
    

    Produces the output:

    [['SE01', 'name', 'loc', '123', []], ['SE01_1', 'name1', 'loc', 'NA', []]]