Search code examples
algorithmdata-structuresrelationshipkaggle

Multi-Channel Contacts Problem: Find related tickets based on their information


Problem Description

The following problem description was taken from the Shopee Code League 2021.

For each ticket, identify all contacts from each user if they have the same contact information. Each of these tickets is related either directly or indirectly through Email, Phone or Order ID, therefore each ticket belongs to the same user. For example:

Ticket ID Email Order ID Phone Contacts
A 0 [email protected] 12345678 NA 5
B 1 NA 12345678 682212345 2
C 34567 [email protected] NA 682212345 4
D 78999 [email protected] NA NA 3
  • Ticket A and B are linked through Order ID
  • Tickets B and C are linked through Phone
  • Tickets C and D are linked through Email
  • Tickets A and D are indirectly linked through tickets A > B > C > D

In this example, this user has a total of 14 Contact. The ticket_trace/contact pair for this user would be 0-1-34567-78999, 14.

For each ticket, identify all other ID that belong to the same user, sorted in ascending order, as well as the total Contact the user had. Generate a csv file with 2 columns in the below format:

ID ticket_trace and Contact
0 0-1-34567-78999, 14
1 0-1-34567-78999, 14

Note that there are 500,000 rows of data. What is the most efficient way of solving this problem in a short time with the least time complexity and memory usage?

Problem Dataset

Sample of input file, contacts.json:

[
   {
      "Id":0,
      "Email":"[email protected]",
      "Phone":"",
      "Contacts":1,
      "OrderId":""
   },
   {
      "Id":1,
      "Email":"",
      "Phone":"329442681752",
      "Contacts":4,
      "OrderId":"vDDJJcxfLtSfkooPhbYnJdxov"
   }, // more data
]

Click here for the problem dataset and more detailed description.


Solution

  • This is my solution using Python3. I used 2 dictionaries to store ticket connections. The first dictionary stores connected tickets by value. Second dictionary stores each id connection (trace). Finally, calculate contacts using trace from the second dictionary. Yeah, also a very slow way. It requires 3 loops. It takes up to 15 seconds to calculate all 500.000 rows in Kaggle kernel.

    Kaggle kernel

    # Import tools
    import numpy as np
    import pandas as pd
    
    # Load data
    df = pd.read_json('/kaggle/input/scl-2021-da/contacts.json')
    npdata = df.values
    
    # Initialize memory
    memory = {}
    connections = {}
    
    # Store connected tickets by value
    def add_to_memory(ticket_id, value):
        if value != "":
            if value in memory:
                memory[value].add(ticket_id)
                return
            memory[value] = {ticket_id}
    
    for row in npdata:
        ticket_id = row[0]
    
        # Order Id
        add_to_memory(ticket_id, row[4])
    
        # Email
        add_to_memory(ticket_id, row[1])
    
        # Phone
        add_to_memory(ticket_id, row[2])
    
    # Calculate Trace
    for ids in memory.values():
        current_connection = set(ids)
    
        for uid in ids:
            if uid in connections:
                current_connection.update(connections[uid])
    
        for uid in current_connection:
            connections[uid] = current_connection
    
    # Calculate contacts and add to list
    output = []
    for ticket_id, trace in sorted(connections.items()):
        contacts = np.sum(npdata[list(trace), 3])
        trace = "-".join([str(_id) for _id in sorted(trace)])
        answer = "{}, {}".format(trace, contacts)
        output.append({"ticket_id": ticket_id,  "ticket_trace/contact": answer})
    
    # Convert to pandas DataFrame & save to csv
    output_df = pd.DataFrame(output)
    filename = "output.csv"
    output_df.to_csv(filename, index=False)
    

    Reference