decimate data in python

I put decimate in the title, but I am not sure that is exactly what I mean. Here is the full description of the issue. I have a dataframe that contains data from several subjects. What I want to do is to analyze data that is X number of days apart. The idea is that I only want to consider data that was collected every, say, 4th day from a subject. The catch here is that the data were collected in parallel for the subjects, so I can't just take every 4th day across subjects but rather need to do the decimation/downsampling/whatever for each subject. The two key columns in the dataframe are "subject" and "session_timestamp". In the latter, the date and time are formatted as in this example: 2017-11-10 16:30:47. Is there a good way to accomplish this in python?

Edit: The first commenters asked for a more concrete example of the dataframe with some example data. Here is a toy dataframe that is similar to what I have and should be easy to work with. The code below creates a dataframe with 4 columns: subjectID, date, score1 and score2. Note that a subject can have more than one entry for a given date (basically, these are neural recordings and each row of the dataframe represents one neuron and we can record more than one neuron per subject)

import pandas as pd
import numpy as np
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["date"] = np.random.randint(20, size=200)#random list of "dates" from 0 to 19
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

What I want to do is to filter for the data (score1 and score2) that was collected at least 4 days apart for each subject. The code could be extremely simple and take the first day that a subject has an entry and every 4th day after that. But a better solution would be if it took the first day, then the next one that is more than 3 days later and then the one that is more than 3 days after that (not every subject has daily samples, so a rigid "every 4th day" code would not be so elegant). All data collected on the allowed days should be included. For example, all data with the day code 0 (if that is the first day of the subject) should be included.

Solution

First create a dataframe (with random data):

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50
ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

min_day = ab.real_date.min()
ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day

print(ab.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-04           3      32      60
          2018-01-05           4      61     135
          2018-01-08          21     477     393
          2018-01-09           8      22     341
          2018-01-10           9     137      30
          2018-01-11          30     281     674
          2018-01-14          13     183     396
          2018-01-15          14      41     337
          2018-01-16          15      83      50

Then fill the days when there are no data with the data of the next existing day:

df = ab.reset_index(level='subjectID').groupby('subjectID').resample('D').mean() #Complete missing dates with NaN
df = df.drop(columns='subjectID')
df = df.groupby(level='subjectID').fillna(method='bfill') #fills the NaN with the first next non NaN value
df = df.apply(pd.to_numeric, downcast='integer') #just to have ints, easier to read

print(df.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-02           3      32      60
          2018-01-03           3      32      60
          2018-01-04           3      32      60
          2018-01-05           4      61     135
          2018-01-06          21     477     393
          2018-01-07          21     477     393
          2018-01-08          21     477     393
          2018-01-09           8      22     341
          2018-01-10           9     137      30

Next resample (group by) periods of 4 days:

res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value
res = res.drop(columns='subjectID')
print(res.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-05           4      61     135
          2018-01-09           8      22     341
          2018-01-13          13     183     396
          2018-01-17          18      91      46
          2018-01-21          20      76     333
          2018-01-25          48     131     212
          2018-01-29          29      92      81
          2018-02-02          32     172      55
          2018-02-06          72      98     246

Finally reset the indexes and take care of the case when there are periods of more than 4 days with no data:

res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore
res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number
res = res.drop(columns='day_number')
res = res.set_index('real_date', append=True)
res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days

print(res.head(10))

                      score1  score2
subjectID real_date                 
0         2018-01-01     306     273
          2018-01-05      61     135
          2018-01-09      22     341
          2018-01-14     183     396
          2018-01-19      91      46
          2018-01-21      76     333
          2018-01-30      92      81
          2018-02-02     172      55
          2018-02-10      40     218
          2018-02-15     110     112

It is a little complicated, but I think this the best way to do it. I have no idea about the efficiency though, but it doesn't seem so bad.