The following problem description was taken from the Shopee Code League 2021.
For each ticket, identify all contacts from each user if they have the same contact information. Each of these tickets is related either directly or indirectly through Email
, Phone
or Order ID
, therefore each ticket belongs to the same user. For example:
Ticket | ID | Order ID | Phone | Contacts | |
---|---|---|---|---|---|
A | 0 | [email protected] | 12345678 | NA | 5 |
B | 1 | NA | 12345678 | 682212345 | 2 |
C | 34567 | [email protected] | NA | 682212345 | 4 |
D | 78999 | [email protected] | NA | NA | 3 |
Order ID
Phone
Email
A
> B
> C
> D
In this example, this user has a total of 14 Contact
. The ticket_trace/contact pair for this user would be 0-1-34567-78999, 14
.
For each ticket, identify all other ID
that belong to the same user, sorted in ascending order, as well as the total Contact
the user had. Generate a csv
file with 2 columns in the below format:
ID | ticket_trace and Contact |
---|---|
0 | 0-1-34567-78999, 14 |
1 | 0-1-34567-78999, 14 |
⋮ | ⋮ |
Note that there are 500,000 rows of data. What is the most efficient way of solving this problem in a short time with the least time complexity and memory usage?
Sample of input file, contacts.json
:
[
{
"Id":0,
"Email":"[email protected]",
"Phone":"",
"Contacts":1,
"OrderId":""
},
{
"Id":1,
"Email":"",
"Phone":"329442681752",
"Contacts":4,
"OrderId":"vDDJJcxfLtSfkooPhbYnJdxov"
}, // more data
]
Click here for the problem dataset and more detailed description.
This is my solution using Python3. I used 2 dictionaries to store ticket connections. The first dictionary stores connected tickets by value. Second dictionary stores each id connection (trace). Finally, calculate contacts using trace from the second dictionary. Yeah, also a very slow way. It requires 3 loops. It takes up to 15 seconds to calculate all 500.000 rows in Kaggle kernel.
# Import tools
import numpy as np
import pandas as pd
# Load data
df = pd.read_json('/kaggle/input/scl-2021-da/contacts.json')
npdata = df.values
# Initialize memory
memory = {}
connections = {}
# Store connected tickets by value
def add_to_memory(ticket_id, value):
if value != "":
if value in memory:
memory[value].add(ticket_id)
return
memory[value] = {ticket_id}
for row in npdata:
ticket_id = row[0]
# Order Id
add_to_memory(ticket_id, row[4])
# Email
add_to_memory(ticket_id, row[1])
# Phone
add_to_memory(ticket_id, row[2])
# Calculate Trace
for ids in memory.values():
current_connection = set(ids)
for uid in ids:
if uid in connections:
current_connection.update(connections[uid])
for uid in current_connection:
connections[uid] = current_connection
# Calculate contacts and add to list
output = []
for ticket_id, trace in sorted(connections.items()):
contacts = np.sum(npdata[list(trace), 3])
trace = "-".join([str(_id) for _id in sorted(trace)])
answer = "{}, {}".format(trace, contacts)
output.append({"ticket_id": ticket_id, "ticket_trace/contact": answer})
# Convert to pandas DataFrame & save to csv
output_df = pd.DataFrame(output)
filename = "output.csv"
output_df.to_csv(filename, index=False)