Search code examples
pythonpandasgroupingdata-analysis

Panda groupby on many columns with agg()


I've been asked to analize the DB from a medical record app. So a bunch of record would look like:

Table image

So i have to resume more than 3 million records from 2011 to 2014 by PX, i know they repeat since thats the ID for each patient, so a patient should had many visitis to the doctor. How could i group them or resume them by patient.


Solution

  • I don't know what you mean by "resume", but it looks like all you want to do is only to sort and display data in a nicer way. You can visually group (=order) the records "px- and fecha-wise" like this:

    df.set_index(['px', 'fecha'], inplace=True)
    

    EDIT:

    When you perform a grouping of the data based on some common property, you have to decide, what kind of aggregation are you going to use on the data in other columns. Simply speaking, once you perform a groupby, you only have one empty field for in each remaining column for each "pacient_id" left, so you must use some aggregation function (e.g. sum, mean, min, avg, count,...) that will return desired representable value of the grouped data.

    It is hard to work on your data since they are locked in an image, and it is impossible to tell what you mean by "Age", since this column is not visible, but I hope you can achieve what you want by looking at the following example with dummy data:

    import pandas as pd
    import numpy as np
    from datetime import datetime
    import random
    from datetime import timedelta
    
    def random_datetime_list_generator(start_date, end_date,n):
        return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n))
    
    #create random dataframe with 4 sample columns and 50000 rows
    rows = 50000
    pacient_id = np.random.randint(100,200,rows)
    dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows)
    age = np.random.randint(10,80,rows)
    bill = np.random.randint(1,1000,rows)
    
    df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill))
    
    print df.head()
    
    # 1.Only perform statictis of the last visit of each pacient only
    stats = df.groupby("pacient_id",as_index=False)["visited"].max()
    stats.columns = ["pacient_id","last_visited"]
    print stats
    
    # 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column
    custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'}
    
    #perform a group by with custom aggregation and renaming of functions
    stats = df.groupby("pacient_id").agg(custom_aggregation)
    #round floats
    stats = stats.round(1)
    print stats
    

    Original dummy dataframe looks like so:

       pacient_id             visited  age  bill
    0         150 2012-12-24 21:34:17   20   188
    1         155 2012-10-26 00:34:45   17   672
    2         116 2011-11-28 13:15:18   33   360
    3         126 2011-06-03 17:36:10   58   167
    4         165 2013-07-15 15:39:31   68   815
    

    First aggregate would look like this:

        pacient_id        last_visited
    0          100 2014-12-29 00:01:11
    1          101 2014-12-22 06:00:48
    2          102 2014-12-26 11:51:41
    3          103 2014-12-29 15:01:32
    4          104 2014-12-18 15:29:28
    5          105 2014-12-30 11:08:29
    

    Second, complex aggregation would look like this:

                           visited                       age         bill
                       first visit          last visit  mean average bill
    pacient_id                                                           
    100        2011-01-06 06:11:33 2014-12-29 00:01:11  45.2        507.9
    101        2011-01-01 20:44:55 2014-12-22 06:00:48  44.0        503.8
    102        2011-01-02 17:42:59 2014-12-26 11:51:41  43.2        498.0
    103        2011-01-01 03:07:41 2014-12-29 15:01:32  43.5        495.1
    104        2011-01-07 18:58:11 2014-12-18 15:29:28  45.9        501.7
    105        2011-01-01 03:43:12 2014-12-30 11:08:29  44.3        513.0
    

    This example should get you going. Additionaly, there is a nice SO question about pandas groupby aggregation which may teach you a lot on this topics.