So, I am implementing a user networking system wherein I take records from a logfile which simply lists each user and the session they have logged.
This is an example of the format:
SessionNumber | UserID
10000 | A0000001
10001 | B3460009
... | ...
Each of these sessions can (and often do) have multiple users in each session
The aim is to assign each user with a Network ID. The rules for the network ID are:
If a users in a session have no assigned Network ID, then assign them a new Network ID (lowest unique positive integer not already assigned to a Network)
If a user in a session already belongs to a network, then all users in that session are assigned that Network ID
If multiple users in a session already belong to networks, then the lowest Network ID is taken and assigned to all users in that session AND to all users in the associated networks
For example, a session occurs with users A,B and C. The table below shows their Network IDs before the session:
User | Network ID
A | -
B | 6
C | 8
Also users in network 8 are :
User | Network ID
D | 8
E | 8
The intended result of the code would be:
User | Network ID
A 6
B 6
C 6
D 6
E 6
I have already developed the code to do this in RStudio by sequentially going through all sessions in the new log file, checking networks and assigning the appropriate value
The problem arises that ideally this logic needs to be deployed in an SQL environment, Teradata via Teradata SQL-Assistant only
I don't have much experience writing logic or for loops in SQL and have only really done batch querying
Is this possible in SQL only and if so what resources/direction is advised to achieve this?
Thanks!
As mentioned I have already coded this in RStudio but it requires temporary holder variables and is not very efficiently coded, and I don't know whether it can be directly translated to SQL
From my research I have found that cursors may be the way to go but almost all resources I find are for SQL in MySQL or SQL Server and not a Teradata environment
EDIT: Below is my code from RStudio which achieved the networking task
Check_List <- Sessions_n_long_Multi %>% select(User_ID) %>% distinct() #get a list of all Users listed distinctly
Check_List$Network_ID <- 0 #create new column which stores Network_ID (which User-Network a User belongs to)
Next_ID <- 0 #Initialise Network_ID counter
for (row_count in 1:nrow(Sessions_n_long_Multi)){ #for all sessions in the long format session-user attribution table
snl_holder_session <- Sessions_n_long_Multi[row_count,1] #get the session number for the current target
snl_holder_pid <- Sessions_n_long_Multi[row_count,2] #get the User_ID for the current target
CL_match <- Check_List %>% filter(User_ID == snl_holder_pid) #get the CheckList data which matches the target User_ID
if(CL_match[1,2]==0){ #if the target user has no Network_ID then..
matched_sessions_user <- Sessions_n_long_Multi %>% filter(User_ID == snl_holder_pid) #get all sessions associated with that user
matched_sessions_all <- Sessions_n_long_Multi %>% filter(SessionNumber %in% matched_sessions_user$SessionNumber) #then get all users associated with that session
matched_Users <- Check_List %>% filter(User_ID %in% matched_sessions_all$User_ID) #then get a list of all those users data from Check List
if(max(matched_Users$Network_ID)>0){ #if the maximum Session ID allocation is greater than zero (so someone already has an allocation in that network)
group_session_IDs <- matched_Users %>% filter(Network_ID!=0) %>% select(Network_ID) #filter out any users who have not been allocated (to remove zero as minimum)
group_session_ID <- min(group_session_IDs$Network_ID) #get the minimum value for Network_ID out of all users who are allocated a Network_ID
associated_networks <- Check_List %>% filter(Network_ID %in% group_session_IDs$Network_ID) #get all those from all the associated networks
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the immediate group
Check_List$Network_ID[match(associated_networks$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the extended group
}else{#if the list of users all have no allocation at all
Next_ID <- Next_ID+1 #iterate the Network_ID counter up
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <-Next_ID #assign the network with the new Network_ID
}#end of if(max(matched_Users$Network_ID)>0) (which has different options depending on whether or not the associated users/sessions with target have an assigned network)
}#end of if(CL_match[1,2]==0) check (which checks if the current target is in a network)
}#end of loop
EDIT: This next block of code is my solution so far. I have gotten some logic to work now, as I can assign each user a number from 1-7 and now I'm working on assigning each session their own network ID. Next focus is to attempt a case where users exist in a network already, then finally, when multiple existing networks are joined
CREATE MULTISET TABLE db.AG_Sessions(
WEB_SESSION int,
USER_ID char(8)
);
INSERT INTO db.AG_Sessions values(1,'a');
INSERT INTO db.AG_Sessions values(1,'b');
INSERT INTO db.AG_Sessions values(2,'c');
INSERT INTO db.AG_Sessions values(2,'d');
INSERT INTO db.AG_Sessions values(3,'e');
INSERT INTO db.AG_Sessions values(3,'f');
INSERT INTO db.AG_Sessions values(3,'a');
CREATE MULTISET TABLE db.AG_Networks(
USER_ID char(8),
NetworkID int
)
CREATE Procedure db.AG_Table_Create()
BEGIN
CREATE VOLATILE TABLE AG_Multi_Check AS( --this volatile table will hold all multi sessions and all current attributions connected to the users in those sessions
SEL
a.WEB_SESSION, --get web session from Session Log file
a.USER_ID , --get USER_ID from Session Log file
b.NetworkID --get network ID from Network Log file
FROM db.AG_Sessions as a --source for Session Log File as a
LEFT JOIN --left joined ( to allow for full preservation of Session Log data even if no network assigned
db.AG_Networks as b --source for Network Log File as b
ON --joining on
a.USER_ID = b.USER_ID --joining where the USER_IDs are the same
WHERE --this where clause ensures that the sessions and users selected are from multi-user sessions
WEB_SESSION in (
SEL WEB_SESSION
FROM db.AG_Sessions
GROUP BY WEB_SESSION
HAVING COUNT(WEB_SESSION) > 1
)
)WITH DATA --create table with data
ON COMMIT PRESERVE ROWS; --populate volatile table with data
END;
Create procedure db.new_trail()
Begin --begin procedure
--declare variables
Declare Current_Account char(8); --holder for current account name
Declare Current_Net,Previous_Net,Current_Session,Previous_Session,Max_Net,New_Net int; --holder for current and previous session and network IDs as well as Max current network value
Declare Cursor_Import cursor For --declare the cursor which will iterate through the new input values (FROM THE VOLATILE TABLE)
Select WEB_SESSION,USER_ID,NetworkID from AG_Multi_Check ORDER By WEB_SESSION; --This pointer will select ALL data from the VOLATILE table
Open Cursor_Import; --open the cursor
--initialise variables
SET Previous_Session = 0;
SET Previous_Net = 0;
SET Max_Net = (SELECT COALESCE(max(NetworkID),0) from db.AG_Networks); --set the value of the max current networkID as either the maximum from the list, or if that does not exist, then sets it to zero
SET New_Net = Max_Net+1; -- this increments the value of the maximum network value by 1
Label_loop:
LOOP
Fetch NEXT from Cursor_Import into Current_Session,Current_Account,Current_Net;
IF SQLSTATE = 02000 THEN
LEAVE Label_loop;
END IF;
Insert into db.AG_Networks(USER_ID ,NetworkID) values(Current_Account,New_Net);
SET Previous_Session = Current_Session;
SET Previous_Net = Current_Net;
SET New_Net = New_Net + 1;
END LOOP Label_loop;
Close Cursor_Import;
End;
Call db.new_trail();
Your current approach mimics the R code and will be quite slow on Teradata because it's using cursors (row-by-row processing is always seqential and especially bad in a parallel system).
Based on your example data (Thanks for providing the sample tables) this SP should update the networks according to your rules. First it assignes new ID to new users and then it updates the new IDs (needs a loop because of possibly multiple sessions per user).
REPLACE PROCEDURE update_network()
BEGIN
MERGE INTO AG_Networks AS tgt
USING
(/* assign a Network ID to new users */
SELECT
s.USER_ID
,Rank() Over (ORDER BY s.USER_ID) -- new sequence
+ Coalesce((SELECT Max(NetworkID) FROM AG_Networks), 0) AS NetworkID -- previous max ID
FROM AG_Sessions AS s --source for Session Log File as a
WHERE NOT EXISTS
( -- only new users
SELECT *
FROM AG_Networks AS n
WHERE s.USER_ID = n.USER_ID
)
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
WHEN NOT MATCHED THEN
INSERT (USER_ID, NetworkID)
VALUES (
src.USER_ID
,src.NetworkID
);
REPEAT -- Update to the new NetworkID per user
MERGE INTO AG_Networks AS tgt
USING
( -- Lowest NetworkID per user from multiple session
SELECT USER_ID, Min(new_ID) AS new_ID
FROM
( -- Lowest NetworkID per session
SELECT s.WEB_SESSION, n.USER_ID, NetworkID
,Min(NetworkID) Over (PARTITION BY s.WEB_SESSION) AS new_ID
FROM AG_Networks AS n
LEFT JOIN AG_Sessions AS s
ON s.USER_ID = n.USER_ID
QUALIFY new_ID <> n.NetworkID -- only new IDs, also removes single user sessions
AND Count(s.WEB_SESSION) Over (PARTITION BY NetworkID)> 0 -- remove networks without new sessions
) AS dt
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
AND src.new_ID <> tgt.NetworkID -- new NetworkID for user
WHEN MATCHED THEN
UPDATE SET NetworkID = src.new_ID
;
-- no new IDs
UNTIL Activity_Count = 0
END REPEAT;
END;