Search code examples
pythonpandasdataframedatetimepython-datetime

Python Pandas - Finding the number of events between two dates (1 data frame and 1 list)


I have a csv file with 2 datetimes (pre-start and pre-end) in each row, as well as a list of datetimes (install_list).

I am trying to iterate through the csv file and add a column that returns the total number of dates from the install_list that are between the pre-start time and the pre-end time in each row.

I am using the code below, but it is returning the total number of items in the list for each row in the csv.

example: File 1 = start time, end time List 1 = install time

Desired Result for Each Row = IF Install Time >= Start Time AND Install Time <= End Time, SUM(Installs)

Col1 (Start Time): 1/1/21 12:00:00 PM

Col2 (End Time): 1/1/21 12:10:00 PM

Install Time List = [1/1/21 12:05:00 PM, 1/1/21 12:11:00 PM]

Desired Result for Row1/Col3 = 1

Code Below:

import datetime
import pandas as pd
from collections import Counter

df_post_logs = pd.read_csv('logs_merged.csv',index_col=False)
df_installs = pd.read_csv('install_merge.csv',index_col=False)

'''Convert UTC to EST on Installs Add Column'''

df_installs['conversion date'] = pd.to_datetime(df_installs['conversion date'],infer_datetime_format='%Y-%m-%d')
df_installs['conversion time'] = pd.to_datetime(df_installs['conversion time'],infer_datetime_format='%H:%S:%M')

utc_datetime = df_installs['conversion time']
est_datetime = utc_datetime - datetime.timedelta(hours=5)


df_installs['utc datetime'] = utc_datetime
df_installs['est datetime'] = est_datetime

'''Add Column 10 Minutes Pre-Spot Time to Post Logs/10 Minutes Post Time to Spot'''

df_post_logs['Air Date'] = pd.to_datetime(df_post_logs['Air Date'],infer_datetime_format='%Y-%m-%d')
df_post_logs['Air Time'] = pd.to_datetime(df_post_logs['Air Time'],infer_datetime_format='%H:%S:%M')

timestamp = df_post_logs['Air Time']

df_post_logs['timestamp'] = timestamp
df_post_logs['pre spot time start'] = timestamp - datetime.timedelta(minutes=10, seconds=1)
df_post_logs['pre spot time end'] = timestamp - datetime.timedelta(seconds=1)
df_post_logs['post spot time'] = timestamp + datetime.timedelta(minutes=10)

'''SUM of Installs between pre-spot time'''

install_list = pd.to_datetime(df_installs['est datetime']).to_list()

for pre_spot_start in df_post_logs['pre spot time start']:
    pre_spot_start_time = pre_spot_start

for pre_spot_end in df_post_logs['pre spot time end']:
    pre_spot_end_time = pre_spot_end

for pre_spot_end in df_post_logs['pre spot time end']:
    pre_spot_end_time = pre_spot_end

pre_spot_install = 0

for row in df_post_logs:
    for date in install_list:
        if date >= pre_spot_start_time and date <= pre_spot_end_time:
            pre_spot_install = pre_spot_install+1

df_post_logs['Pre Spot Install'] = pre_spot_install

df_post_logs.to_csv('Test.csv')

Solution

  • The following code will print for each row, how many values in install_dates are between the respective values in the start and end columns of the dataframe:

    import pandas as pd
    
    df = pd.DataFrame({
      "start": pd.to_datetime(["2018-07-11", "2018-06-10"]),
      "end": pd.to_datetime(["2018-07-20", "2018-06-30"]),
    })
    
    install_dates = pd.to_datetime(["2018-06-25", "2018-07-01", "2018-07-15", "2018-07-18"])
    
    def num_install_dates_between_start_and_end(row):
        return len([d for d in install_dates if row["start"] <= d <= row["end"]])
        
    print(df.agg(num_install_dates_between_start_and_end, axis="columns"))
    

    It uses agg to collapse the information of a row to one number. The way how information is "collapsed" is specified in the function num_install_dates_between_start_and_end, which counts how many elements from install_dates are between the start/end value in the row.