I have a dataset with cars being spotted on two different cameras. I need to calculate the average time it takes to travel from camera 1 to 2. Database looking like so:
"ID","PLATE", "CSPOTID", "INOUT_FLAG","EVENT_TIME"
"33173","xys8","1","0","2020-08-27 08:24:53"
"33174","asd4","1","0","2020-08-27 08:24:58"
"33175","------","2","1","2020-08-27 08:25:03"
"33176","asd4","1","0","2020-08-27 08:25:04"
"33177","ghj1","1","0","2020-08-27 08:25:08"
...
Currently my code works as intended and calculates the average time between different rows. But working with big data and having a flow of incoming data, it takes too much time.
import numpy as np, matplotlib.pyplot as plt, pandas as pd,collections, sys, operator, datetime
df = pd.read_csv('tmetrics_base2.csv', quotechar='"', skipinitialspace=True, delimiter=',', dtype={"ID": int, "PLATE": "string", "CSPOTID": int, "INOUT_FLAG": int,"EVENT_TIME": "string"})
data = df.as_matrix()
#Sort values by PLATE
dfSortedByPlate = df.sort_values(['PLATE', 'EVENT_TIME'])
#List for already tested PLATEs
TestedPlate = []
resultList = []
#Iterate through all rows in db
for i,j in dfSortedByPlate.iterrows():
# If PLATE IS "------" = skip it
if j[1] == "-------":
continue
if j[1] in TestedPlate:
continue
TestedPlate.append(j[1])
for ii,jj in dfSortedByPlate.iterrows():
if j[1] != jj[1]:
continue
if j[1] == jj[1]:
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
I have sorted the database by plate number so that the comparison should be fairly quick. Any advice or pointers to where I could increase run speed would be greatly appreciated.
Also I am unsure of how long time I should expect calculations like these to take? I don't have experience with data in this scale.
You can speed up your code by removing unnecessary for loops. You can use in-built pandas functions that are typically faster than iterating through rows in the df. For instance, you can replace the two for loops by:
#get only relevant plates
df_relevant = dfSortedByPlate[dfSortedByPlate['PLATE'] != "-------"]
#test relevant plates
for i,j in df_relevant.iterrows():
df_same_plate_j = df_relevant[df_relevant['PLATE'] == j[1]]
for ii, jj in df_same_plate_j.iterrows():
dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
Travel_time = []
Travel_time.append((dt1 - dt2).total_seconds())
# Discard if greater than 1 hour and less than 3min
if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
resultList.append((dt1 - dt2).total_seconds())
#print((dt1 - dt2).total_seconds())
print(sum(resultList) / len(resultList))
placeholdertime = jj[4]
df_relevant
now contains all plates that you want to test. Then, df_same_plate_j
gets the rows in df_relevant
that have the same plate as row j
. Then you do the rest. This way, the number of items you are iterating over is much less.