Search code examples
machine-learningneo4jcyphergraph-databases

Understanding FastRP vs scaleProperties


I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)

During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.

Project graph

//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})

Scale variables of interest between 0-1 for future Euclidean distance calculation

//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten

We then can run KNN based on euclidean distance

//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
               {nodeProperties: {scaledProperties: "EUCLIDEAN"},
               topK: 15,
               mutateRelationshipType: "IS_SIMILAR",
               mutateProperty: "similarity",
               similarityCutoff: 0.6409912109375,
               sampleRate:1,
               randomSeed:42,
               concurrency:1}
              )

However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.

FastRP

CALL gds.fastRP.mutate(
  'bd_graph',
  {
    embeddingDimension: 8,
    mutateProperty: 'fastrp-embedding',
    featureProperties: ['bd_load', 'weight', 'length']
  }
)
YIELD nodePropertiesWritten

But when I run the below code

ALL gds.knn.stats("bd_graph",
   {
      nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
      topK:10,
      sampleRate:1,
      randomSeed:42,
      concurrency:1
   }
) YIELD similarityDistribution 
RETURN similarityDistribution

I receive an error:

Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
      nodeProperties:{fastrp-embedding:"EUCLIDEAN"},

Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?


Solution

  • I am glad you are finding the tutorial helpful and getting into GDS!

    Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/

    The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.

    CALL gds.knn.stats("bd_graph",
       {
          nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
          topK:10,
          sampleRate:1,
          randomSeed:42,
          concurrency:1
       }
    ) YIELD similarityDistribution 
    RETURN similarityDistribution
    

    The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.