Search code examples
pythonpandasreduction

Pandas data reduction and merging


I am working with a Pandas (version 0.17.1) DataFrame that looks like this:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36637 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36638 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36639 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36716 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36717 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36718 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36719 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36720 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36721 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36722 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36723 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36724 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36725 2016-08-25 17:59:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36726 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36727 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property A' = some_value_1
36786 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property B' = some_value_2
36787 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property C' = some_value_3
36788 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  Received Status Monitoring from MODULE_1 'Property D' = some_value_4
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 1' = some_value_a
36828 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 2' = some_value_b
36829 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 3' = some_value_c
36830 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 4' = some_value_d
36831 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 5' = some_value_e
36832 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 6' = some_value_f
36833 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 7' = some_value_g
36834 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 8' = some_value_h
36835 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 9' = some_value_i
36836 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 10' = some_value_j
36837 2016-08-25 19:01:50.964  ERROR   MOD_2_NAME  STATUS  Didn't receive Status Monitoring 'Parameter 11' from MODULE_2!
36838 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 12' = some_value_k
36839 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  Received Status Monitoring from MODULE_2 'Parameter 13' = some_value_l

(The frame has already been reduced to remove rows that are not of interest. That is why the index column has missing numbers)

As you can see there are multiple parameters read from a device at the same time. Each reading is a separate row. I would like to do some "reduction" and "compression" so that each reading is only a single row. I would also like the content column to be a dictionary so I could easily lookup a particular item of interest. So the result would look like this:

                         time   type   module     msg_type         content
36636 2016-08-25 17:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36715 2016-08-25 17:59:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}
36785 2016-08-25 18:59:50.051   INFO  MOD_1_NAME  STATUS  {'Property A' = 'some_value_1', 'Property B' = 'some_value_2', 'Property C' = 'some_value_3', 'Property D' = 'some_value_4'}
36827 2016-08-25 19:01:50.964   INFO   MOD_2_NAME  STATUS  {'Parameter 1' = 'some_value_a', 'Parameter 2' = 'some_value_b', 'Parameter 3' = 'some_value_c', 'Parameter 4' = 'some_value_d', 'Parameter 5' = 'some_value_e', 'Parameter 6' = 'some_value_f', 'Parameter 7' = 'some_value_g','Parameter 8' = some_value_h, 'Parameter 9' = 'some_value_i', 'Parameter 10' = 'some_value_j', 'Parameter 11' = '', 'Parameter 12' = 'some_value_k', 'Parameter 13' = 'some_value_l'}

So basically I would like for all rows with the same value for their time and module columns to be "merged" together, with their contents columns parsed into a dictionary. (There could also be some "missing" or "empty" readings.) I don't want to filter or remove data, just reduce and summarize it.

I am guessing that I need to you some combination of groupby(), transform(), and apply() but I am not sure where to even begin.

Part of my difficulty is that I cannot inspect the result of groupby() to see if it is doing what I want.

g1 = df.groupby(['module', 'time'])

g1 does not show up in the Spyder variable explorer. printing does not show anything. I cannot access attribute index or call info() on g1. But I am having doubts that groupby() is even worthwhile here... I don't want to eliminate anything.

Been doing some searching to find an example but keep getting what seems like false positives. Any help to get started would be appreciated.


Solution

  • Define a function and use groupby() and then apply():

    In [235]: def create_data_dict(rows):
         ...:     return {k:v for k,v in re.findall(r"'([^']*)' = ([^ ]*)", ' '.join(rows.content.astype(str)))}
         ...: 
    
    In [236]: df[df['type'] != 'ERROR'].groupby(['time', 'module', 'msg_type']).apply(create_data_dict).to_frame(name = 'content').reset_index()
    Out[236]: 
                          time      module msg_type                                                                                                                                                                                                                                                                                                                                                                                                          content
    0  2016-08-25 17:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
    1  2016-08-25 17:59:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}
    2  2016-08-25 18:59:50.051  MOD_1_NAME   STATUS                                                                                                                                                                                                                                                                                 {u'Property A': u'some_value_1', u'Property C': u'some_value_3', u'Property B': u'some_value_2', u'Property D': u'some_value_4'}
    3  2016-08-25 19:01:50.964  MOD_2_NAME   STATUS  {u'Parameter 6': u'some_value_f', u'Parameter 7': u'some_value_g', u'Parameter 4': u'some_value_d', u'Parameter 5': u'some_value_e', u'Parameter 2': u'some_value_b', u'Parameter 3': u'some_value_c', u'Parameter 1': u'some_value_a', u'Parameter 8': u'some_value_h', u'Parameter 9': u'some_value_i', u'Parameter 10': u'some_value_j', u'Parameter 12': u'some_value_k', u'Parameter 13': u'some_value_l'}