Search code examples
pythondownsampling

decimate data in python


I put decimate in the title, but I am not sure that is exactly what I mean. Here is the full description of the issue. I have a dataframe that contains data from several subjects. What I want to do is to analyze data that is X number of days apart. The idea is that I only want to consider data that was collected every, say, 4th day from a subject. The catch here is that the data were collected in parallel for the subjects, so I can't just take every 4th day across subjects but rather need to do the decimation/downsampling/whatever for each subject. The two key columns in the dataframe are "subject" and "session_timestamp". In the latter, the date and time are formatted as in this example: 2017-11-10 16:30:47. Is there a good way to accomplish this in python?

Edit: The first commenters asked for a more concrete example of the dataframe with some example data. Here is a toy dataframe that is similar to what I have and should be easy to work with. The code below creates a dataframe with 4 columns: subjectID, date, score1 and score2. Note that a subject can have more than one entry for a given date (basically, these are neural recordings and each row of the dataframe represents one neuron and we can record more than one neuron per subject)

import pandas as pd
import numpy as np
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["date"] = np.random.randint(20, size=200)#random list of "dates" from 0 to 19
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

What I want to do is to filter for the data (score1 and score2) that was collected at least 4 days apart for each subject. The code could be extremely simple and take the first day that a subject has an entry and every 4th day after that. But a better solution would be if it took the first day, then the next one that is more than 3 days later and then the one that is more than 3 days after that (not every subject has daily samples, so a rigid "every 4th day" code would not be so elegant). All data collected on the allowed days should be included. For example, all data with the day code 0 (if that is the first day of the subject) should be included.


Solution

  • First create a dataframe (with random data):

    import pandas as pd
    import numpy as np
    from datetime import datetime, timedelta
    
    ab = pd.DataFrame()
    ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
    ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50
    ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates
    ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
    ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement
    
    min_day = ab.real_date.min()
    ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day
    
    print(ab.head(10))
    
                          day_number  score1  score2
    subjectID real_date                             
    0         2018-01-01           0     306     273
              2018-01-04           3      32      60
              2018-01-05           4      61     135
              2018-01-08          21     477     393
              2018-01-09           8      22     341
              2018-01-10           9     137      30
              2018-01-11          30     281     674
              2018-01-14          13     183     396
              2018-01-15          14      41     337
              2018-01-16          15      83      50
    

    Then fill the days when there are no data with the data of the next existing day:

    df = ab.reset_index(level='subjectID').groupby('subjectID').resample('D').mean() #Complete missing dates with NaN
    df = df.drop(columns='subjectID')
    df = df.groupby(level='subjectID').fillna(method='bfill') #fills the NaN with the first next non NaN value
    df = df.apply(pd.to_numeric, downcast='integer') #just to have ints, easier to read
    
    print(df.head(10))
    
                          day_number  score1  score2
    subjectID real_date                             
    0         2018-01-01           0     306     273
              2018-01-02           3      32      60
              2018-01-03           3      32      60
              2018-01-04           3      32      60
              2018-01-05           4      61     135
              2018-01-06          21     477     393
              2018-01-07          21     477     393
              2018-01-08          21     477     393
              2018-01-09           8      22     341
              2018-01-10           9     137      30
    

    Next resample (group by) periods of 4 days:

    res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value
    res = res.drop(columns='subjectID')
    print(res.head(10))
    
                          day_number  score1  score2
    subjectID real_date                             
    0         2018-01-01           0     306     273
              2018-01-05           4      61     135
              2018-01-09           8      22     341
              2018-01-13          13     183     396
              2018-01-17          18      91      46
              2018-01-21          20      76     333
              2018-01-25          48     131     212
              2018-01-29          29      92      81
              2018-02-02          32     172      55
              2018-02-06          72      98     246
    

    Finally reset the indexes and take care of the case when there are periods of more than 4 days with no data:

    res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore
    res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number
    res = res.drop(columns='day_number')
    res = res.set_index('real_date', append=True)
    res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days
    
    print(res.head(10))
    
                          score1  score2
    subjectID real_date                 
    0         2018-01-01     306     273
              2018-01-05      61     135
              2018-01-09      22     341
              2018-01-14     183     396
              2018-01-19      91      46
              2018-01-21      76     333
              2018-01-30      92      81
              2018-02-02     172      55
              2018-02-10      40     218
              2018-02-15     110     112
    

    It is a little complicated, but I think this the best way to do it. I have no idea about the efficiency though, but it doesn't seem so bad.