Funnel analysis is usually done by getting the base data, and from that doing subqueries within that data to see what amount of data can be bucketed into that section. A good use case would be:
User signup (1,290) ==> Sign in (897) ==> User purchase (42)
I am trying to analyze various logs for sequence/anomaly detection. The order of events and the timestamps are very important, unlike common funnel analysis. Let's take the following example, where we are trying to flag a suspicious activity (such as malware installation) when a user opens a text editor and then makes a network request. Here is what the log might look like:
[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 17:56:03,566] generic.py:98@main [INFO] Network request made
[2019-07-14 17:58:03,883] generic.py:98@main [INFO] File closed
This we would flag as suspicious because the network request was made right after the file was opened.
However, this would not be suspicious:
[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 17:58:05,883] generic.py:98@main [INFO] File closed
[2019-07-14 17:56:06,566] generic.py:98@main [INFO] Network request made
And, we'll also put a time-limit of "10 seconds" from when the file is first opened to when the network request was made. So this also would not be flagged as suspicious:
[2019-07-14 17:54:04,251] generic.py:98@main [INFO] File opened
[2019-07-14 18:56:06,566] generic.py:98@main [INFO] Network request made
[2019-07-14 18:58:05,883] generic.py:98@main [INFO] File closed
Currently, this is being done with a for
loop and a lot of internal objects used to check the conditions. However, I was wondering if it would be possible to do this type of "error detection" directly in SQL, as it would be much faster (if possible), and I could easily use the data in the form of timestamp / action
, for example:
session_id timestamp action
123 2019-07-14 17:54:04,251 file_opened
123 2019-07-15 17:54:04,251 network_request
123 2019-07-16 17:54:04,251 file_closed
So, back to the above question:
SQL is a query language which is not designed for computation. Further, a database is also not designed for computation. It focuses on write, read, storing, consistency, etc. but not computation.
So in this case, I think that SQL can be used for your straightforward example, but it won't be suitable for your real situation as it can be too complex.
And back to the computation, it depends on the amount of data. For a relatively simple task, we usually use pandas
, scipy
, numpy
as tools. While for some huge tasks, we might use spark
as a computation engine and use a connector to connect spark with the database, then use pyspark
to manipulate the data.
Here is an example of pandas
based on your given example:
import pandas as pd
import io
import datetime
# Prepare data, you can assume that you have already got this
data = """
session_id;timestamp;action
123;2019-07-14 17:54:04,251;file_opened
123;2019-07-15 17:54:04,251;network_request
123;2019-07-16 17:54:04,251;file_closed
""".strip()
df = pd.read_csv(io.BytesIO(data.encode()), delimiter=";", parse_dates=["timestamp"])
df["action_before"] = df["action"].shift(1)
df["time_elapsed"] = df["timestamp"] - df["timestamp"].shift(1)
# This will give you an empty dataframe as the third condition is not satisfied.
result = df[(df["action"] == "network_request")
& (df["action_before"] == "file_opened")
& (df["time_elapsed"] < datetime.timedelta(minutes=10))]