Search code examples
pythonpandasflattenjson-flattener

Pandas more time efficient flatten


In this question I got help with flatten of each row in a column of a dataframe.

[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]

The code is

    df = pd.DataFrame()

for d in data:
    df_tmp = pd.json_normalize(d)
    row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
    row.columns = df_tmp.columns
    df = pd.concat([df, row])

print(df.reset_index(drop=True))

The code turns data into

  first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value  ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
1             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
2             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
3             1653195600000                     1653193960416000                          NaN  ...                              NaN                     NaN                                NaN
4             1653102000000                     1653098744032000                            1  ...                              NaN              1653165977                   1653165977727000

This code works good but takes enormous amount of time. To go over 1 column of 1 day of data (660k rows) it took about 10 hours. And there are 2 such columns. Means python should work 24/7.

Is there a way to rewrite it to make more time efficient?

P.S. Also, to make that column I used this code:

    def transformation_user_properties(row):
                return [{elem['key']: elem['value']} for elem in row['user_properties']]
    df['user_properties'] = df.apply(transformation_user_properties, axis=1)

Which transform colums with rows of this form

[{'a': 'b', 'c': {'c1': 'v1', 'c2': 'v2'}}, {'a': 'b1', 'c': {'c1': 'x1', 'c2': 'x2'}}, {'a': 'b2', 'c': {'c1': 'n1', 'c2': 'n2'}}]

into this form

[{'b': {'c1': 'v1', 'c2': 'v2'}}, {'b1': {'c1': 'x1', 'c2': 'x2'}}, {'b2': {'c1': 'n1', 'c2': 'n2'}}]

Solution

  • You can use nested list and dict comprehension:

    data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
    [{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
    [{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
    [{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
    [{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
    ]
    

    L = [{f'{k}.{k1}': v1 for y in x 
                          for k, v in y.items()
                          for k1, v1 in v.items()} 
                          for x in data] 
    
    df = pd.DataFrame(L)  
    

    print (df)
    
      first_open_time.int_value first_open_time.set_timestamp_micros  \
    0             1652796000000                     1652792823456000   
    1             1652796000000                     1652792823456000   
    2             1652796000000                     1652792823456000   
    3             1653195600000                     1653193960416000   
    4             1653102000000                     1653098744032000   
    
      User_dedication.string_value User_dedication.set_timestamp_micros  \
    0                            1                     1653137417352000   
    1                            1                     1653137166688000   
    2                            1                     1653137288213000   
    3                          NaN                                  NaN   
    4                            1                     1653109414734000   
    
      User_activity.string_value User_activity.set_timestamp_micros  \
    0                          1                   1653136561498000   
    1                          1                   1653136561498000   
    2                          1                   1653136561498000   
    3                        NaN                                NaN   
    4                       1_10                   1653109414730000   
    
      Minutes_in_app.string_value Minutes_in_app.set_timestamp_micros  \
    0                      60_300                    1653137417352000   
    1                      60_300                    1653137166688000   
    2                      60_300                    1653137288213000   
    3                         NaN                                 NaN   
    4                        1_10                    1653109414735000   
    
      ga_session_number.int_value ga_session_number.set_timestamp_micros  \
    0                          10                       1653136552555000   
    1                          10                       1653136552555000   
    2                          10                       1653136552555000   
    3                         NaN                                    NaN   
    4                           3                       1653165977727000   
    
      Paying_user.string_value Paying_user.set_timestamp_micros  \
    0                        0                 1653136561498000   
    1                        0                 1653136561498000   
    2                        0                 1653136561498000   
    3                      NaN                              NaN   
    4                      NaN                              NaN   
    
      ga_session_id.int_value ga_session_id.set_timestamp_micros  
    0              1653136552                   1653136552555000  
    1              1653136552                   1653136552555000  
    2              1653136552                   1653136552555000  
    3                     NaN                                NaN  
    4              1653165977                   1653165977727000  
    

    Performance for 5k values:

    data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
    [{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
    [{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
    [{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
    [{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
    ]*1000
    

    In [92]: %%timeit
        ...: df = pd.DataFrame()
        ...: 
        ...: for d in data:
        ...:     df_tmp = pd.json_normalize(d)
        ...:     row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
        ...:     row.columns = df_tmp.columns
        ...:     df = pd.concat([df, row])
        ...: 
        ...: df.reset_index(drop=True)
        ...: 
        ...: 
    36.9 s ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [93]: %%timeit
        ...: L = [{f'{k}.{k1}': v1  
        ...:        for y in x 
        ...:        for k, v in y.items()
        ...:        for k1, v1 in v.items()} for x in data] 
        ...: 
        ...: pd.DataFrame(L)
        ...: 
        ...: 
    
    40.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)