Search code examples
neo4jcypherneo4j-apoc

Cypher query for list pattern


I have a schema which looks like below:

  1. A customer is linked to another customer with a relationship SIMILAR having similarity score.

    Example: (c1:Customer)-->(c2:Customer)

  2. An Email node is connected to each customer with relationship MAIL_AT with the following node properties:

 {
     "active_email_address": "a@mail.com",
     "cibil_email_addresses": [
         "b@mail.com", "c@mail.com"
     ]
 }

Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)

  1. A Risk node with some risk-related properties (below) and is related to customer with relationship HAS_RISK:
  {
    "f0_score": 870.0,
    "pta_score": 430.0
  }
  1. A Fraud node with some fraud-related properties (below) and is related to customer with relationship IS_FRAUD:
  {
    "has_commited_fraud": true
  }

My Objectives:

  1. To find the customers with common email addresses (irrespective of active and secondary)?

My tentative solution:

MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers

Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses into different nodes as they can vary - Should I create two nodes with different cibil email addresses and connect both of them to customer with relationship HAS_CIBIL_EMAIL? (Is this a valid schema design). Also, it is possible, a customer's active_email_address is present in other customer's cibil_email_address. I'm trying to find a synthetic identity attack. PS: If there exists some APOC that can help achieve this and below, do suggest with example.

  1. In production, for a given customer with email addresses, risk values, similarity score, and also given other customers may or may not be tagged with fraud_status, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use any gds to solve this, please suggest with examples.

  2. If I were to do this same exercise with some other node such as Address which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?

I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. @cybersam any help? Thanks.


Solution

  • This should work:

    MATCH (e:Email)
    UNWIND (e.cibil_email_addresses + e.active_email_address) AS address
    WITH address, COLLECT(e) AS es
    UNWIND es AS email
    MATCH (email)<-[:MAIL_AT]-(cust)
    RETURN address, COLLECT(cust) AS customers
    

    The WITH clause takes advantage of the arregating function COLLECT to automatically collect all the Email nodes containing the same address, using address as the grouping key.

    You should only ask one question at a time. You have a couple of other questions at the bottom. If you continue to need help with them, please create new questions.