Search code examples
pythonpandastime-seriesindoor-positioning-system

Calculating duration per user per location user of Wi-Fi timeseries logs


I have a Wi-Fi log dataset that shows users interacting with Wireless access points (AP). I was able to generate new column named floor that shows which floor the APs are on, giving a very rough estimation of user position in the building.

Here is a simplified sample of the data set:

ID _user vlan_role Floor Datetime
1 aaa Staff 1 2022-01-03 07:21:49-06:00
2 bbb Staff 1 2022-01-03 07:21:49-06:00
3 ccc Student 1 2022-01-03 07:22:57-06:00
4 bbb Staff 1 2022-01-03 07:25:21-06:00
5 bbb Staff 1 2022-01-03 07:25:21-06:00
6 bbb Staff 1 2022-01-03 07:25:26-06:00
7 aaa Staff 1 2022-01-03 07:26:13-06:00
8 aaa Staff 1 2022-01-03 07:26:13-06:00
9 bbb Staff 1 2022-01-03 07:26:57-06:00
10 ccc Student 1 2022-01-03 07:26:57-06:00
12 bbb Staff 2 2022-01-03 08:23:22-06:00
13 bbb Staff 2 2022-01-03 08:23:22-06:00
14 ccc Student 2 2022-01-03 08:23:44-06:00
16 ccc Student 2 2022-01-03 08:23:49-06:00
17 aaa Staff 2 2022-01-03 08:43:19-06:00
18 ccc Student 2 2022-01-03 08:43:19-06:00
19 bbb Staff 2 2022-01-03 08:43:51-06:00
20 aaa Staff 2 2022-01-03 08:56:27-06:00
21 aaa Staff 2 2022-01-03 08:56:27-06:00
22 aaa Staff 2 2022-01-03 08:56:29-06:00
23 aaa Staff 6 2022-01-03 08:56:31-06:00
24 bbb Staff 6 2022-01-03 08:56:32-06:00
25 bbb Staff 6 2022-01-03 09:08:16-06:00
28 aaa Staff 6 2022-01-03 09:08:39-06:00
29 ccc Student 1 2022-01-03 09:08:52-06:00
31 bbb Staff 1 2022-01-03 09:08:57-06:00
32 aaa Staff 1 2022-01-03 09:10:22-06:00

My end goal is to generate a duration for each user at at each floor. After some research I put this together:

df["duration"] = df.groupby("Floor")["DateTime"].transform(lambda x: np.ptp(x.to_numpy()))

which will calculate the duration of each floor, but it doesn't account for users or break off when changing floors. Rather it just presents a total time for each floor (shown below).

ID _user vlan_role Floor DateTime duration
1 aaa Staff 1 2022-01-03 07:21:49-06:00 0 days 01:48:33
2 bbb Staff 1 2022-01-03 07:21:49-06:00 0 days 01:48:33
3 ccc Student 1 2022-01-03 07:22:57-06:00 0 days 01:48:33
4 bbb Staff 1 2022-01-03 07:25:21-06:00 0 days 01:48:33
5 bbb Staff 1 2022-01-03 07:25:21-06:00 0 days 01:48:33
6 bbb Staff 1 2022-01-03 07:25:26-06:00 0 days 01:48:33
7 aaa Staff 1 2022-01-03 07:26:13-06:00 0 days 01:48:33
8 aaa Staff 1 2022-01-03 07:26:13-06:00 0 days 01:48:33
9 bbb Staff 1 2022-01-03 07:26:57-06:00 0 days 01:48:33
10 ccc Student 1 2022-01-03 07:26:57-06:00 0 days 01:48:33
12 bbb Staff 2 2022-01-03 08:23:22-06:00 0 days 00:33:07
13 bbb Staff 2 2022-01-03 08:23:22-06:00 0 days 00:33:07
14 ccc Student 2 2022-01-03 08:23:44-06:00 0 days 00:33:07
16 ccc Student 2 2022-01-03 08:23:49-06:00 0 days 00:33:07
17 aaa Staff 2 2022-01-03 08:43:19-06:00 0 days 00:33:07
18 ccc Student 2 2022-01-03 08:43:19-06:00 0 days 00:33:07
19 bbb Staff 2 2022-01-03 08:43:51-06:00 0 days 00:33:07
20 aaa Staff 2 2022-01-03 08:56:27-06:00 0 days 00:33:07
21 aaa Staff 2 2022-01-03 08:56:27-06:00 0 days 00:33:07
22 aaa Staff 2 2022-01-03 08:56:29-06:00 0 days 00:33:07
23 aaa Staff 6 2022-01-03 08:56:31-06:00 0 days 00:12:08
24 bbb Staff 6 2022-01-03 08:56:32-06:00 0 days 00:12:08
25 bbb Staff 6 2022-01-03 09:08:16-06:00 0 days 00:12:08
28 aaa Staff 6 2022-01-03 09:08:39-06:00 0 days 00:12:08
29 ccc Student 1 2022-01-03 09:08:52-06:00 0 days 01:48:33
31 bbb Staff 1 2022-01-03 09:08:57-06:00 0 days 01:48:33
32 aaa Staff 1 2022-01-03 09:10:22-06:00 0 days 01:48:33

Since the Wi-fi devices are constantly interacting with the APs the raw data has a bunch of duplicates. For instance in the table above, User "aaa" interacts with a 1st floor AP three times before moving to the 2nd floor at ID 17. So ideally what I am trying to do is identify the new change in floor value (ID 17) and subtract that from the old change in floor value (ID 1)... for each user. Below is an example of my expected output that I manually calculated. Duplicates are removed and a floor duration is added.

oldID newID user role Floor DateTime Duration
1 1 aaa Staff 1 2022-01-03 07:21:49-06:00 1:21:30
2 2 bbb Staff 1 2022-01-03 07:21:49-06:00 1:01:33
3 3 ccc Student 1 2022-01-03 07:22:57-06:00 1:00:47
12 4 bbb Staff 2 2022-01-03 08:23:22-06:00 0:33:10
14 5 ccc Student 2 2022-01-03 08:23:44-06:00 0:48:08
17 6 aaa Staff 2 2022-01-03 08:43:19-06:00 0:13:12
23 7 aaa Staff 6 2022-01-03 08:56:31-06:00 0:13:51
24 8 bbb Staff 6 2022-01-03 08:56:32-06:00 0:12:25
29 9 ccc Student 1 2022-01-03 09:08:52-06:00
31 10 bbb Staff 1 2022-01-03 09:08:57-06:00
32 11 aaa Staff 1 2022-01-03 09:10:22-06:00

Solution

  • So it looks like there are a few problems with the data set that we'll need to resolve before getting good information:

    • There are duplicate values (ie - ID 7 and 8 have the same timestamp)
    • The user can interact with a floor more than once before moving

    The first one can be cleaned up by running df.drop_duplicates() to get rid of a user interacting multiple times, which cleans up the data.

    The second one is a bit more complicated - in addition to a user interacting with a floor multiple times, we also have the edge case of the end of data where a user doesn't switch the floor at the end of the data set (which it looks like you want no data for that based on your manual calculation).

    For this, I recommend using DataFrame.iterrows, a generator that yields both the index and row (as a Series).

    As you loop through the rows, you could calculate the data series and store the data in a temporary variable. I'd use a dictionary because you can be very verbose about what you're storing, like this:

    # Pseudo code of what that data would look like
    calculated_data = { 
        'aaa': {
            'last_floor':1, 
            'last_time':'2022-01-03 07:21:49-06:00'
        }
    }
    

    As you iterate through your dataframe, check each row against the calculated_data dict to see if the user changed floors - if they did change floors, take the difference from the last_time against the new time. This might look something like:

    ### Pseudo code
    # final_data will store the output calculations
    final_data = pd.DataFrame(columns=['user', 'role','Floor','DateTime','Duration'])
    
    for index, row in df.iterrows(): # df is your raw data
        # Check to see if user exists in dict - if not, add
        if (row['user'] not in calculated_data):
            calculated_data[row['user']] = {'last_floor':row['Floor'],'last_time':row['DateTime']}
    
        # User moved up a floor - calculate the difference and save
        elif (row['Floor'] != calculated_data[row['user']]['last_floor']):
            duration = row['DateTime'] - calculated_data[row['user']]['last_time']
            final_data.loc[len(final_data)] = [row['user'],row['role'],calculated_data[row['user']]['last_floor'],calculated_data[row['user']]['last_time'],duration
            # Update the dict with the latest floor data
            calculated_data[row['user']] = {'last_floor':row['Floor'],'last_time':row['DateTime']}
    

    After that piece of code runs, you'd just want a cleanup at the end where you iterate through the dictionary to include the last set of data of floors that will have no duration and add to the final data frame (I'm referencing the last few lines of your manual calculation).

    The above code snippet could be cleaned up a bit more to make it more readable, but this is the general gist - hope it helps!

    EDIT:

    After proofreading, I made a huge assumption that your DataTime column is sorted. If you're going to use my suggestion, make sure your data is sorted!