neo4j distributed-computing high-availability

Neo4j - Difference between High Availability and Distributed Mechanism?

Neo4j explain about the clustering through a concept called High Availability. And, What I know about clustering with hadoop knowledge is distributed computing.

What are the difference between these two concepts?

Thanks

Solution

Neo4j High Availability refers to an approach for scaling the number of requests to which Neo4j can respond. Neo4j HA implements a master slave with replication clustering model for high availability scaling. This means that all writes go to the "master" server (or are proxied to master from the slaves) and the update is synchronized among the slave servers. Reads can be sent to any server in the cluster, scaling out the number of requests to which the database can respond.

Compare this to distributed computing, which is a general term to describe how computation operations can be done in parallel across a large number of machines. One key difference is the concept of data sharding. With Neo4j each server in the cluster contains a full copy of the graph, whereas with a distributed filesystem such as HDFS, the data is sharded and each machine stores only a small piece of the entire dataset.

Part of the reason Neo4j does not shard the graph is that since a graph is a highly connected data structure, traversing through a distributed/sharded graph would involve lots of network latency as the traversal "hops" from machine to machine.