Search code examples
pythonstatisticsdata-analysis

Data analysis optimazation


I have a dataset with cars being spotted on two different cameras. I need to calculate the average time it takes to travel from camera 1 to 2. Database looking like so:

"ID","PLATE", "CSPOTID", "INOUT_FLAG","EVENT_TIME"
"33173","xys8","1","0","2020-08-27 08:24:53"
"33174","asd4","1","0","2020-08-27 08:24:58"
"33175","------","2","1","2020-08-27 08:25:03"
"33176","asd4","1","0","2020-08-27 08:25:04"
"33177","ghj1","1","0","2020-08-27 08:25:08"
 ...

Currently my code works as intended and calculates the average time between different rows. But working with big data and having a flow of incoming data, it takes too much time.

import numpy as np, matplotlib.pyplot as plt, pandas as pd,collections, sys, operator, datetime
df = pd.read_csv('tmetrics_base2.csv', quotechar='"', skipinitialspace=True, delimiter=',', dtype={"ID": int, "PLATE": "string", "CSPOTID": int, "INOUT_FLAG": int,"EVENT_TIME": "string"})
data = df.as_matrix()

#Sort values by PLATE
dfSortedByPlate = df.sort_values(['PLATE', 'EVENT_TIME'])

#List for already tested PLATEs
TestedPlate = []
resultList = []
#Iterate through all rows in db
for i,j in dfSortedByPlate.iterrows():
    # If PLATE IS "------" = skip it
    if j[1] == "-------":
        continue
    if j[1] in TestedPlate:
        continue
    TestedPlate.append(j[1])
    for ii,jj in dfSortedByPlate.iterrows():
        if j[1] != jj[1]:           
            continue
        if j[1] == jj[1]:
            dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
            dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
            Travel_time = []
            Travel_time.append((dt1 - dt2).total_seconds())
            # Discard if greater than 1 hour and less than 3min
            if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
                resultList.append((dt1 - dt2).total_seconds())
                #print((dt1 - dt2).total_seconds())
                print(sum(resultList) / len(resultList))
            placeholdertime = jj[4]

I have sorted the database by plate number so that the comparison should be fairly quick. Any advice or pointers to where I could increase run speed would be greatly appreciated.

Also I am unsure of how long time I should expect calculations like these to take? I don't have experience with data in this scale.


Solution

  • You can speed up your code by removing unnecessary for loops. You can use in-built pandas functions that are typically faster than iterating through rows in the df. For instance, you can replace the two for loops by:

    #get only relevant plates
    df_relevant = dfSortedByPlate[dfSortedByPlate['PLATE'] != "-------"]
    
    #test relevant plates
    for i,j in df_relevant.iterrows():
        df_same_plate_j = df_relevant[df_relevant['PLATE'] == j[1]]
        for ii, jj in df_same_plate_j.iterrows():
            dt1 = datetime.datetime.strptime(jj[4],'%Y-%m-%d %H:%M:%S')
            dt2 = datetime.datetime.strptime(j[4],'%Y-%m-%d %H:%M:%S')
            Travel_time = []
            Travel_time.append((dt1 - dt2).total_seconds())
            # Discard if greater than 1 hour and less than 3min
            if (dt1 - dt2).total_seconds() < 3000 and (dt1 - dt2).total_seconds() > 180:
                resultList.append((dt1 - dt2).total_seconds())
                #print((dt1 - dt2).total_seconds())
                print(sum(resultList) / len(resultList))
            placeholdertime = jj[4]
    

    df_relevant now contains all plates that you want to test. Then, df_same_plate_j gets the rows in df_relevant that have the same plate as row j. Then you do the rest. This way, the number of items you are iterating over is much less.