I have a csv file with 2 datetimes (pre-start and pre-end) in each row, as well as a list of datetimes (install_list).
I am trying to iterate through the csv file and add a column that returns the total number of dates from the install_list that are between the pre-start time and the pre-end time in each row.
I am using the code below, but it is returning the total number of items in the list for each row in the csv.
example: File 1 = start time, end time List 1 = install time
Desired Result for Each Row = IF Install Time >= Start Time AND Install Time <= End Time, SUM(Installs)
Col1 (Start Time): 1/1/21 12:00:00 PM
Col2 (End Time): 1/1/21 12:10:00 PM
Install Time List = [1/1/21 12:05:00 PM, 1/1/21 12:11:00 PM]
Desired Result for Row1/Col3 = 1
Code Below:
import datetime
import pandas as pd
from collections import Counter
df_post_logs = pd.read_csv('logs_merged.csv',index_col=False)
df_installs = pd.read_csv('install_merge.csv',index_col=False)
'''Convert UTC to EST on Installs Add Column'''
df_installs['conversion date'] = pd.to_datetime(df_installs['conversion date'],infer_datetime_format='%Y-%m-%d')
df_installs['conversion time'] = pd.to_datetime(df_installs['conversion time'],infer_datetime_format='%H:%S:%M')
utc_datetime = df_installs['conversion time']
est_datetime = utc_datetime - datetime.timedelta(hours=5)
df_installs['utc datetime'] = utc_datetime
df_installs['est datetime'] = est_datetime
'''Add Column 10 Minutes Pre-Spot Time to Post Logs/10 Minutes Post Time to Spot'''
df_post_logs['Air Date'] = pd.to_datetime(df_post_logs['Air Date'],infer_datetime_format='%Y-%m-%d')
df_post_logs['Air Time'] = pd.to_datetime(df_post_logs['Air Time'],infer_datetime_format='%H:%S:%M')
timestamp = df_post_logs['Air Time']
df_post_logs['timestamp'] = timestamp
df_post_logs['pre spot time start'] = timestamp - datetime.timedelta(minutes=10, seconds=1)
df_post_logs['pre spot time end'] = timestamp - datetime.timedelta(seconds=1)
df_post_logs['post spot time'] = timestamp + datetime.timedelta(minutes=10)
'''SUM of Installs between pre-spot time'''
install_list = pd.to_datetime(df_installs['est datetime']).to_list()
for pre_spot_start in df_post_logs['pre spot time start']:
pre_spot_start_time = pre_spot_start
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
for pre_spot_end in df_post_logs['pre spot time end']:
pre_spot_end_time = pre_spot_end
pre_spot_install = 0
for row in df_post_logs:
for date in install_list:
if date >= pre_spot_start_time and date <= pre_spot_end_time:
pre_spot_install = pre_spot_install+1
df_post_logs['Pre Spot Install'] = pre_spot_install
df_post_logs.to_csv('Test.csv')
The following code will print for each row, how many values in install_dates
are between the respective values in the start
and end
columns of the dataframe:
import pandas as pd
df = pd.DataFrame({
"start": pd.to_datetime(["2018-07-11", "2018-06-10"]),
"end": pd.to_datetime(["2018-07-20", "2018-06-30"]),
})
install_dates = pd.to_datetime(["2018-06-25", "2018-07-01", "2018-07-15", "2018-07-18"])
def num_install_dates_between_start_and_end(row):
return len([d for d in install_dates if row["start"] <= d <= row["end"]])
print(df.agg(num_install_dates_between_start_and_end, axis="columns"))
It uses agg
to collapse the information of a row to one number.
The way how information is "collapsed" is specified in the function num_install_dates_between_start_and_end
, which counts how many elements from install_dates
are between the start/end value in the row.