Search code examples
algorithmmathgraph-theoryknowledge-graph

What is the best approach for large scale Paths and Funnels Analysis?


We have a big dataset of user actions on our internal apps. I am trying to create an algorithm for Paths & Funnels analytics which will take parameters for Paths (i.e. Start and End point) and a defined step of actions for Funnel. What is the best algorithm to program this with large data? The output should be just counts of users for specific set of actions like this :

Format of the file to scan:

UserID Action TS
1 A 06/04/2022
1 B 06/04/2022
1 C 06/04/2022
1 D 06/04/2022
2 G 06/04/2022
2 H 06/04/2022
2 K 06/04/2022

Algorithm input parameters:

  1. For Path : User statistics on the start point A and end point F
  2. For Funnel: User statistics on the defined steps A->B->C->D
Path Count
A->B->C->D 385
G->H->K 89

where A,B,C,D,... are nodes for user actions or pages.

This should be easy using Python for a smaller set, but the issue is, I am worried about performance, as I am dealing with millions of records like this. Please help!


Solution

  • Assuming that

    ...
    1 A ts
    1 B ts
    ...
    

    in the input data means user 1 went A -> B

    the algorithm is

    CREATE new table paths_users_followed
    CREATE new path
    LOOP over data input rows, except last
       IF user in row equals user in row+1
          ADD action in row to path
          IF row+1 is last row
             ADD action in last to path
             ADD user, path to paths_users_followed
       ELSE
          ADD user, path to paths_users_followed
          CREATE new PATH
    ENDLOOP
    
    LOOP P over input of "path statistics"
        COUNT occurrences of P in paths_users_followed
    

    This can be most easily and efficiently implemented using a high performance database engine - I would use SQLite.