Search code examples
pythonneo4jcypherpy2neopubmed

neo4j CYPHER - at ON MATCH SET create new nodes on condition


To import XML data into a neo4j DB I first parse the XML to a python dictionary and then use CYPHER queries:

WITH $pubmed_dict as pubmed_article
UNWIND pubmed_article as particle
...
FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
  MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!')})
  ON CREATE SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
  ON MATCH SET a.first_name = author.ForeName, a.affiliation = author.AffiliationInfo.Affiliation
  MERGE (p)<-[:WROTE]-(a)      
)

Unfortunately, Authors don't have unique IDs in the database, so it might be that different authors have the same last names but different initials or affiliations.

...
                <Author ValidYN="Y">
                    <LastName>Smith</LastName>
                    <ForeName>A L</ForeName>
                    <Initials>AL</Initials>
                    <AffiliationInfo>
                        <Affiliation>University X</Affiliation>
                    </AffiliationInfo>
                </Author>
...
                <Author ValidYN="Y">
                    <LastName>Smith</LastName>
                    <ForeName>A L</ForeName>
                    <Initials>AL</Initials>
                    <AffiliationInfo>
                        <Affiliation>University BUMBABU</Affiliation>
                    </AffiliationInfo>
                </Author>

My intention was to MERGE on author.LastName but ON MATCH check if the author has the same ForeName OR the same Affiliation and if not create a new node instead.

How would I do that using CYPHER queries?

EDIT 1

Node Key constraints are the solution, which is an Enterprise Edition feature, though. Looking for a workaround for that.

EDIT 2

This code is working almost perfectly:

WITH $pubmed_dict as pubmed_article
    UNWIND pubmed_article as particle
        MERGE (p:Publication {pmid: particle.MedlineCitation.PMID.text})
        ON CREATE SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)
        ON MATCH SET p.title = COALESCE (particle.MedlineCitation.Article.Journal.Title, particle.MedlineCitation.Article.ArticleTitle)

    FOREACH (author IN particle.MedlineCitation.Article.AuthorList.Author |
      MERGE (a:Author {last_name: COALESCE(author.LastName, 'LAST NAME MISSING!'), first_name: COALESCE(author.ForeName, 'FIRST NAME MISSING!')})
      MERGE (p)<-[:WROTE]-(a)      
    )

To sum it up: For every author I want to create a new author IF LastName OR ForeName OR Affiliation are different. I also need NEW Nodes for authors where LAST NAME MISSING! and FIRST NAME MISSING!

Is it possible to achieve this result WITHOUT Key Node Constraints? (because this is an Enterprise Edition feature...)


Solution

  • The authors do have a unique ID in Neo4j, the node ID. That can be used to identify the node and then the set the properties. Maybe something like this:

    Match (a:Author{LastName:'xxx',ForeName:'yyy'}) 
    with a, id(a) as ID
    where ID > -1
    match (b) where id(b)=ID set b.first_name = author.ForeName, b.affiliation = author.AffiliationInfo.Affiliation
    

    The node's ID is not necessarily stable or predictable, so you have to access it directly before using it.

    Because you are using python code, you might to better with a global query to pull down the author node data:

    match (a:Author{LastName:'xxx',ForeName:'yyy'})  return a.LastName,a.ForeName,id(a) as ID
    

    then, you can write a csv file to bulk upload the desired info. The csv could look like this:

    > "ID","ForeName","LastName","Affiliation" 
    "26","David","Smith","Johns Hopkins" 
    etc.
    

    The python code could do the filtering of nodes that do not need processing.

    Then load the file:

    LOAD CVS with HEADER file:///'xxx.csv' as line 
    match (a) where id(a)=toInteger(line.ID) 
    set a.Affiliation=line.toString(line.Affiliation")