Search code examples
pythonpandaslistlist-comprehensiondictionary-comprehension

Why is this dictionary comprehension so slow? Please suggest way to speed it up


Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!

classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}

with the output: {1:[0,1,0...0],......, 4000:[0,1,1...0]}

from a df like this:

data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'], 
         'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
                               'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212', 
                               'DB06606|DB06607|DB06608|DB06609', 
                               'DB06606|DB06607', 
                               'DB06608']
                             }

pd.DataFrame(data = data_  , index=range(0,5) )

I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.

Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.


Solution

  • Here is one approach:

    import pandas as pd
    
    # create data frame
    df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
    
    # split on '|' to convert string to list
    df['col'] = df['col'].str.split('|')
    
    # explode to get one row for each list element
    df = df.explode('col')
    
    # create dummy ID (this will become True in the final result)
    df['dummy'] = 1
    
    # use pivot to create dense matrix
    df = (df.pivot(index='idx', columns='col', values='dummy')
            .fillna(0)
            .astype(int))
    
    # convert each row to a list
    df['test'] = df.apply(lambda x: x.to_list(), axis=1)
    print(df)
    
    col  1  2  3  4          test
    idx                          
    1    1  1  0  0  [1, 1, 0, 0]
    2    1  1  1  0  [1, 1, 1, 0]
    3    0  1  1  0  [0, 1, 1, 0]
    4    1  0  0  1  [1, 0, 0, 1]