Search code examples
pythonsqlpandasanomaly-detection

"Funnel analysis" where order matters


Funnel analysis is usually done by getting the base data, and from that doing subqueries within that data to see what amount of data can be bucketed into that section. A good use case would be:

User signup (1,290) ==> Sign in (897) ==> User purchase (42)

I am trying to analyze various logs for sequence/anomaly detection. The order of events and the timestamps are very important, unlike common funnel analysis. Let's take the following example, where we are trying to flag a suspicious activity (such as malware installation) when a user opens a text editor and then makes a network request. Here is what the log might look like:

[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 17:56:03,566] generic.py:98@main [INFO] Network request made
[2019-07-14 17:58:03,883] generic.py:98@main [INFO] File closed

This we would flag as suspicious because the network request was made right after the file was opened.

However, this would not be suspicious:

[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 17:58:05,883] generic.py:98@main [INFO] File closed
[2019-07-14 17:56:06,566] generic.py:98@main [INFO] Network request made

And, we'll also put a time-limit of "10 seconds" from when the file is first opened to when the network request was made. So this also would not be flagged as suspicious:

[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 18:56:06,566] generic.py:98@main [INFO] Network request made
[2019-07-14 18:58:05,883] generic.py:98@main [INFO] File closed

Currently, this is being done with a for loop and a lot of internal objects used to check the conditions. However, I was wondering if it would be possible to do this type of "error detection" directly in SQL, as it would be much faster (if possible), and I could easily use the data in the form of timestamp / action, for example:

session_id          timestamp                       action
123                 2019-07-14 17:54:04,251         file_opened
123                 2019-07-15 17:54:04,251         network_request
123                 2019-07-16 17:54:04,251         file_closed

So, back to the above question:

  • Is there a name for this specific type of anomaly detection?
  • Could SQL be used for it (in a realistic way)? While the above example is quite straightforward, the conditions could get much more complex.
  • If not SQL, what are the most common tools to do this sort of stuff (for example, pandas? If so, could an example be given)?

Solution

  • SQL is a query language which is not designed for computation. Further, a database is also not designed for computation. It focuses on write, read, storing, consistency, etc. but not computation.

    So in this case, I think that SQL can be used for your straightforward example, but it won't be suitable for your real situation as it can be too complex.

    And back to the computation, it depends on the amount of data. For a relatively simple task, we usually use pandas, scipy, numpy as tools. While for some huge tasks, we might use spark as a computation engine and use a connector to connect spark with the database, then use pyspark to manipulate the data.

    Here is an example of pandas based on your given example:

    import pandas as pd
    import io
    import datetime
    
    # Prepare data, you can assume that you have already got this
    data = """
    session_id;timestamp;action
    123;2019-07-14 17:54:04,251;file_opened
    123;2019-07-15 17:54:04,251;network_request
    123;2019-07-16 17:54:04,251;file_closed
    """.strip()
    df = pd.read_csv(io.BytesIO(data.encode()), delimiter=";", parse_dates=["timestamp"])
    
    df["action_before"] = df["action"].shift(1)
    df["time_elapsed"] = df["timestamp"] - df["timestamp"].shift(1)
    
    # This will give you an empty dataframe as the third condition is not satisfied.
    result =  df[(df["action"] == "network_request") 
                 & (df["action_before"] == "file_opened") 
                 & (df["time_elapsed"] < datetime.timedelta(minutes=10))]