Search code examples
neo4jcypherneo4j-apocapoc

Can't apply fuzzy distance function in a Cypher query that checks similarity against all nodes attributes values?


I would like to find all triplets where the main node contains in one of it's properties some value, using some fuzzy similarity function and filtering the results above some predefined threshold, say 85%. What would be the best practice to do this? Here is my initial query:

MATCH (n)-[r]->(k) WHERE ANY(x in keys(n) WHERE round(apoc.text.levenshteinSimilarity(n[x], "syn"), 4) > 0.8) RETURN n, r, k

Before the above query, I have used more simple approach (regex):

MATCH (n)-[r]->(k) WHERE ANY(x in keys(n) WHERE n[x] =~ '(i?){search_expression}.*') RETURN n, r, k

But when I'm using the first, more advanced query, for some reason I am getting:

Wrong argument type: Can't coerce `Long(1662902792106)` to String

When I am running the following query:

MATCH (n)-[r]->(k) WHERE ANY(x in keys(n) WHERE round(apoc.text.levenshteinSimilarity(toString(n[x]), "syn"), 4) > 0.8) RETURN n, r, k

the output is:

Invalid input for function 'toString()': Expected a String, Number, Boolean, Temporal or Duration, got: StringArray[ecr:PutImageTagMutability, ecr:StartImageScan, ecr:DescribeImageReplicationStatus, ecr:ListTagsForResource, ecr:UploadLayerPart, ecr:BatchDeleteImage, ecr:CreatePullThroughCacheRule, ecr:ListImages, ecr:BatchGetRepositoryScanningConfiguration, ecr:DeleteRepository, ecr:GetRegistryScanningConfiguration, ecr:CompleteLayerUpload, ecr:TagResource, ecr:DescribeRepositories, ecr:BatchCheckLayerAvailability, ecr:ReplicateImage, ecr:GetLifecyclePolicy, ecr:GetRegistryPolicy, ecr:PutLifecyclePolicy, ecr:DescribeImageScanFindings, ecr:GetLifecyclePolicyPreview, ecr:CreateRepository, ecr:DescribeRegistry, ecr:PutImageScanningConfiguration, ecr:GetDownloadUrlForLayer, ecr:DescribePullThroughCacheRules, ecr:GetAuthorizationToken, ecr:PutRegistryScanningConfiguration, ecr:DeletePullThroughCacheRule, ecr:DeleteLifecyclePolicy, ecr:PutImage, ecr:BatchImportUpstreamImage, ecr:UntagResource, ecr:BatchGetImage, ecr:DescribeImages, ecr:StartLifecyclePolicyPreview, ecr:InitiateLayerUpload, ecr:GetRepositoryPolicy, ecr:PutReplicationConfiguration]

Please advise.


Solution

  • You need to take a look at the value of n[x] where x is a property of node n. Thus, n[x] can be an Integer, Float, String, Boolean, Point, Date, Time, LocalTime, DateTime, LocalDateTime, Duration or a Homogeneous lists of simple types. Converting a string array using toString() function fails on list.

    Therefore, you need to think of a function to convert any data type into string so that you can apply apoc.text.levenshteinSimilarity function on this node property n[x].

    MATCH (n)-[r]->(k) 
    WHERE ANY(x in keys(n) 
        WHERE round(apoc.text.levenshteinSimilarity(
           TRIM(
              REDUCE(mergedString = "", item in n[x] 
                   | mergedString + item + " ")), "syn"), 4) 
                       > 0.8) 
    RETURN n, r, k
    

    Where reduce function will concatenate each item of a list (or array) into a string and trim removes the extra space at the end of such string.

    Reference: https://neo4j.com/docs/cypher-manual/current/syntax/values/ https://neo4j.com/docs/cypher-manual/current/functions/list/#functions-reduce https://neo4j.com/docs/cypher-manual/current/functions/string/#functions-trim