I put decimate in the title, but I am not sure that is exactly what I mean. Here is the full description of the issue. I have a dataframe that contains data from several subjects. What I want to do is to analyze data that is X number of days apart. The idea is that I only want to consider data that was collected every, say, 4th day from a subject. The catch here is that the data were collected in parallel for the subjects, so I can't just take every 4th day across subjects but rather need to do the decimation/downsampling/whatever for each subject. The two key columns in the dataframe are "subject" and "session_timestamp". In the latter, the date and time are formatted as in this example: 2017-11-10 16:30:47. Is there a good way to accomplish this in python?
Edit: The first commenters asked for a more concrete example of the dataframe with some example data. Here is a toy dataframe that is similar to what I have and should be easy to work with. The code below creates a dataframe with 4 columns: subjectID, date, score1 and score2. Note that a subject can have more than one entry for a given date (basically, these are neural recordings and each row of the dataframe represents one neuron and we can record more than one neuron per subject)
import pandas as pd
import numpy as np
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["date"] = np.random.randint(20, size=200)#random list of "dates" from 0 to 19
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement
What I want to do is to filter for the data (score1 and score2) that was collected at least 4 days apart for each subject. The code could be extremely simple and take the first day that a subject has an entry and every 4th day after that. But a better solution would be if it took the first day, then the next one that is more than 3 days later and then the one that is more than 3 days after that (not every subject has daily samples, so a rigid "every 4th day" code would not be so elegant). All data collected on the allowed days should be included. For example, all data with the day code 0 (if that is the first day of the subject) should be included.
First create a dataframe (with random data):
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50
ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement
min_day = ab.real_date.min()
ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day
print(ab.head(10))
day_number score1 score2
subjectID real_date
0 2018-01-01 0 306 273
2018-01-04 3 32 60
2018-01-05 4 61 135
2018-01-08 21 477 393
2018-01-09 8 22 341
2018-01-10 9 137 30
2018-01-11 30 281 674
2018-01-14 13 183 396
2018-01-15 14 41 337
2018-01-16 15 83 50
Then fill the days when there are no data with the data of the next existing day:
df = ab.reset_index(level='subjectID').groupby('subjectID').resample('D').mean() #Complete missing dates with NaN
df = df.drop(columns='subjectID')
df = df.groupby(level='subjectID').fillna(method='bfill') #fills the NaN with the first next non NaN value
df = df.apply(pd.to_numeric, downcast='integer') #just to have ints, easier to read
print(df.head(10))
day_number score1 score2
subjectID real_date
0 2018-01-01 0 306 273
2018-01-02 3 32 60
2018-01-03 3 32 60
2018-01-04 3 32 60
2018-01-05 4 61 135
2018-01-06 21 477 393
2018-01-07 21 477 393
2018-01-08 21 477 393
2018-01-09 8 22 341
2018-01-10 9 137 30
Next resample (group by) periods of 4 days:
res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value
res = res.drop(columns='subjectID')
print(res.head(10))
day_number score1 score2
subjectID real_date
0 2018-01-01 0 306 273
2018-01-05 4 61 135
2018-01-09 8 22 341
2018-01-13 13 183 396
2018-01-17 18 91 46
2018-01-21 20 76 333
2018-01-25 48 131 212
2018-01-29 29 92 81
2018-02-02 32 172 55
2018-02-06 72 98 246
Finally reset the indexes and take care of the case when there are periods of more than 4 days with no data:
res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore
res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number
res = res.drop(columns='day_number')
res = res.set_index('real_date', append=True)
res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days
print(res.head(10))
score1 score2
subjectID real_date
0 2018-01-01 306 273
2018-01-05 61 135
2018-01-09 22 341
2018-01-14 183 396
2018-01-19 91 46
2018-01-21 76 333
2018-01-30 92 81
2018-02-02 172 55
2018-02-10 40 218
2018-02-15 110 112
It is a little complicated, but I think this the best way to do it. I have no idea about the efficiency though, but it doesn't seem so bad.