Search code examples
sqlteradatateradata-sql-assistant

How to iterate over all records to network rows from shared events?


So, I am implementing a user networking system wherein I take records from a logfile which simply lists each user and the session they have logged.

This is an example of the format:

SessionNumber |  UserID                                                                   
10000     | A0000001                                                                   
10001     | B3460009                                                                   
  ...      |    ...                                                                   

Each of these sessions can (and often do) have multiple users in each session

The aim is to assign each user with a Network ID. The rules for the network ID are:

If a users in a session have no assigned Network ID, then assign them a new Network ID (lowest unique positive integer not already assigned to a Network)

If a user in a session already belongs to a network, then all users in that session are assigned that Network ID

If multiple users in a session already belong to networks, then the lowest Network ID is taken and assigned to all users in that session AND to all users in the associated networks

For example, a session occurs with users A,B and C. The table below shows their Network IDs before the session:

User |  Network ID                                                                   
 A   |      -                                                                   
 B   |      6                                                                   
 C   |      8                                                                   

Also users in network 8 are :

User |  Network ID                                                                   
 D   |      8                                                                   
 E   |      8                                                                    

The intended result of the code would be:

User |  Network ID                                                                    
 A          6                                                                   
 B          6                                                                                                                                      
 C          6                                                                   
 D          6                                                                   
 E          6                                                                   

I have already developed the code to do this in RStudio by sequentially going through all sessions in the new log file, checking networks and assigning the appropriate value

The problem arises that ideally this logic needs to be deployed in an SQL environment, Teradata via Teradata SQL-Assistant only

I don't have much experience writing logic or for loops in SQL and have only really done batch querying

Is this possible in SQL only and if so what resources/direction is advised to achieve this?

Thanks!

As mentioned I have already coded this in RStudio but it requires temporary holder variables and is not very efficiently coded, and I don't know whether it can be directly translated to SQL

From my research I have found that cursors may be the way to go but almost all resources I find are for SQL in MySQL or SQL Server and not a Teradata environment

EDIT: Below is my code from RStudio which achieved the networking task

Check_List <- Sessions_n_long_Multi %>% select(User_ID) %>% distinct() #get a list of all Users listed distinctly
Check_List$Network_ID <- 0 #create new column which stores Network_ID (which User-Network a User belongs to)

Next_ID <- 0 #Initialise Network_ID counter
for (row_count in 1:nrow(Sessions_n_long_Multi)){ #for all sessions in the long format session-user attribution table
  snl_holder_session <- Sessions_n_long_Multi[row_count,1] #get the session number for the current target
  snl_holder_pid <- Sessions_n_long_Multi[row_count,2] #get the User_ID for the current target
  CL_match <- Check_List %>% filter(User_ID == snl_holder_pid) #get the CheckList data which matches the target User_ID


  if(CL_match[1,2]==0){ #if the target user has no Network_ID then..
    matched_sessions_user <- Sessions_n_long_Multi %>% filter(User_ID == snl_holder_pid) #get all sessions associated with that user
    matched_sessions_all <- Sessions_n_long_Multi %>% filter(SessionNumber %in% matched_sessions_user$SessionNumber) #then get all users  associated  with that session
    matched_Users <- Check_List %>% filter(User_ID %in% matched_sessions_all$User_ID) #then get a list of all those users data from Check List

    if(max(matched_Users$Network_ID)>0){ #if the maximum Session ID allocation is greater than zero (so someone already has an allocation in that network)
      group_session_IDs <- matched_Users %>% filter(Network_ID!=0) %>% select(Network_ID) #filter out any users who have not been allocated (to remove zero as minimum)
      group_session_ID <- min(group_session_IDs$Network_ID) #get the minimum value for Network_ID out of all users who are allocated a Network_ID

      associated_networks <- Check_List %>% filter(Network_ID %in% group_session_IDs$Network_ID) #get all those from all the associated networks

      Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the immediate group
      Check_List$Network_ID[match(associated_networks$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the extended group

    }else{#if the list of users all have no allocation at all
      Next_ID <- Next_ID+1 #iterate the Network_ID counter up
      Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <-Next_ID #assign the network with the new Network_ID
    }#end of if(max(matched_Users$Network_ID)>0) (which has different options depending on whether or not the associated users/sessions with target have an assigned network)

  }#end of if(CL_match[1,2]==0) check (which checks if the current target is in a network)
}#end of loop

EDIT: This next block of code is my solution so far. I have gotten some logic to work now, as I can assign each user a number from 1-7 and now I'm working on assigning each session their own network ID. Next focus is to attempt a case where users exist in a network already, then finally, when multiple existing networks are joined

CREATE MULTISET TABLE db.AG_Sessions(
WEB_SESSION int,
USER_ID char(8)
);

INSERT INTO db.AG_Sessions values(1,'a');
INSERT INTO db.AG_Sessions values(1,'b');
INSERT INTO db.AG_Sessions values(2,'c');
INSERT INTO db.AG_Sessions values(2,'d');
INSERT INTO db.AG_Sessions values(3,'e');
INSERT INTO db.AG_Sessions values(3,'f');
INSERT INTO db.AG_Sessions values(3,'a');

CREATE MULTISET TABLE db.AG_Networks(
USER_ID char(8),
NetworkID int
)

CREATE Procedure db.AG_Table_Create()
BEGIN
CREATE VOLATILE  TABLE AG_Multi_Check AS( --this volatile table will hold all multi sessions and all current attributions connected to the users in those sessions
SEL 
a.WEB_SESSION, --get web session from Session Log file
a.USER_ID , --get USER_ID from Session Log file
b.NetworkID --get network ID from Network Log file
 FROM db.AG_Sessions as a --source for Session Log File as a
 LEFT JOIN --left joined ( to allow for full preservation of Session Log data even if no network assigned
 db.AG_Networks as b --source for Network Log File as b
 ON --joining on
 a.USER_ID = b.USER_ID --joining where the USER_IDs are the same
WHERE --this where clause ensures that the sessions and users selected are from multi-user sessions
WEB_SESSION in (
SEL WEB_SESSION
FROM db.AG_Sessions
GROUP BY WEB_SESSION
HAVING COUNT(WEB_SESSION) > 1
)
)WITH DATA --create table with data
    ON COMMIT PRESERVE ROWS; --populate volatile table with data
END;

Create procedure db.new_trail()

Begin --begin procedure
--declare variables
Declare Current_Account char(8); --holder for current account name
Declare Current_Net,Previous_Net,Current_Session,Previous_Session,Max_Net,New_Net int; --holder for current and previous session and network IDs as well as Max current network value

Declare Cursor_Import  cursor  For --declare the cursor which will iterate through the new input values (FROM THE VOLATILE TABLE)
Select WEB_SESSION,USER_ID,NetworkID  from AG_Multi_Check ORDER By WEB_SESSION; --This pointer will select ALL data from the VOLATILE table

Open Cursor_Import; --open the cursor

--initialise variables
SET Previous_Session = 0;
SET Previous_Net = 0;
SET Max_Net = (SELECT COALESCE(max(NetworkID),0) from db.AG_Networks); --set the value of the max current networkID as either the maximum from the list, or if that does not exist, then sets it to zero
SET New_Net = Max_Net+1; -- this increments the value of the maximum network value by 1

Label_loop:
LOOP
Fetch NEXT from Cursor_Import into Current_Session,Current_Account,Current_Net; 
IF SQLSTATE = 02000 THEN
  LEAVE Label_loop;
END IF;

Insert into  db.AG_Networks(USER_ID ,NetworkID) values(Current_Account,New_Net);
SET Previous_Session = Current_Session;
SET Previous_Net = Current_Net;
SET New_Net = New_Net + 1;

END LOOP Label_loop;

Close Cursor_Import;
End;

Call db.new_trail();

Solution

  • Your current approach mimics the R code and will be quite slow on Teradata because it's using cursors (row-by-row processing is always seqential and especially bad in a parallel system).

    Based on your example data (Thanks for providing the sample tables) this SP should update the networks according to your rules. First it assignes new ID to new users and then it updates the new IDs (needs a loop because of possibly multiple sessions per user).

    REPLACE PROCEDURE update_network()
    BEGIN
    
    MERGE INTO AG_Networks AS tgt
    USING
     (/* assign a Network ID to new users */
       SELECT 
          s.USER_ID
         ,Rank() Over (ORDER BY s.USER_ID) -- new sequence
          + Coalesce((SELECT Max(NetworkID) FROM AG_Networks), 0) AS NetworkID  -- previous max ID
       FROM AG_Sessions AS s --source for Session Log File as a
       WHERE NOT EXISTS
        ( -- only new users
          SELECT *
          FROM AG_Networks AS n
          WHERE s.USER_ID = n.USER_ID
        )
       GROUP BY 1
     ) AS src
    ON src.USER_ID = tgt.USER_ID
    WHEN NOT MATCHED THEN
    INSERT (USER_ID, NetworkID)
    VALUES (
       src.USER_ID
      ,src.NetworkID
     );
    
    REPEAT -- Update to the new NetworkID per user
    
       MERGE INTO AG_Networks AS tgt
       USING
        ( -- Lowest NetworkID per user from multiple session
          SELECT USER_ID, Min(new_ID) AS new_ID
          FROM
           ( -- Lowest NetworkID per session
             SELECT s.WEB_SESSION, n.USER_ID, NetworkID
                  ,Min(NetworkID) Over (PARTITION BY s.WEB_SESSION) AS new_ID
                FROM AG_Networks AS n
                LEFT JOIN AG_Sessions AS s
                  ON s.USER_ID = n.USER_ID
                QUALIFY new_ID <> n.NetworkID -- only new IDs, also removes single user sessions
                    AND Count(s.WEB_SESSION) Over (PARTITION BY NetworkID)> 0 -- remove networks without new sessions
           ) AS dt 
          GROUP BY 1
        ) AS src
       ON src.USER_ID = tgt.USER_ID
       AND src.new_ID <> tgt.NetworkID -- new NetworkID for user
       WHEN MATCHED THEN
       UPDATE SET NetworkID = src.new_ID
       ;
       -- no new IDs
       UNTIL Activity_Count = 0 
       END REPEAT;
    END;