Search code examples
rdfgraph-databasestriplestoreanzographn-quads

Exploring AnzoGraph as an alternative for Transactional RDF Workloads


Let's say my application needs to regularly persist knowledge generated by users in RDF format, preserving ACID transaction properties (with the begin and end of the transaction being controlled by my application). In other words, I need to be able to use AnzoGraph HTTP API to send RDF data (quads) on demand, preferably in the body of HTTP requests (instead of files stored somewhere). Is it currently possible?

As this is currently possible in other triplestores (such as Jena Fuseki and AllegroGraph) it surprised me that I could not yet find a way to do this with AnzoGraph. I've tried exploring the /sparql, /rdf-graph-store, /rdf-graphs APIs but none of them provided the functionality I need (The /data API only seems to support ingestion in Turtle format, as I received a Content-Type not supported error for every other option I tried). Is there any documentation/tutorial that could help me with that?

A key requirement for my application is to be able to preserve the graphs I have in the input quads. For example, let's say I have batches of quads (from different graphs) being generated by the application and I need to ingest them via HTTP requests. Another important requirement would be to have some container of RDF graphs, such as Jena Fuseki's datasets or AllegroGraph repositories, that would allow me to use the same AnzoGraph instance for different services/applications that manage independent RDF Datasets. Does AnzoGraph have support for something like that?

Finally, I have been able to achieve what I need using SPARQL INSERT DATA queries, but given that other triplestore provide other quad ingestion mechanisms I assumed AnzoGraph would have some better (maybe more performant) alternative. If it is not currently supported, is it part of the development roadmap?


Solution

  • AnzoGraph has been architected for OLAP. Essentially it is an MPP data warehouse like say Snowflake or Red Shift, but for graph (i.e. much more complex) data. It is designed to treat a cluster of servers a single database and scales out horizontally with a nothing shared design. Unlike other graph stores you may have experienced, you should think about it as a venue for integrating data as opposed to a destination for integrated data or a backend for a transaction oriented application. It supports very fast load, ETL, ELT and Virtualization. A common pattern of use is to load data and then manipulate it through arbitrarily complex transformation queries and then do your analytics when you are happy with the graph structure, as you might any data warehouse. You can read more here https://blog.cambridgesemantics.com/anzograph-db-benchmarking-guide

    It uses graphs as the containers you mention to store entire data sets but it does not support quads. We do not have it on the roadmap at this time. Quads proved to be too heavy to scale for our primary data integration/analytics use cases and now we use other approaches for segmenting data. Documentation on loading here https://docs.cambridgesemantics.com/anzograph/v2.5/userdoc/azg-data.htm

    You can use SPARQL INSERT/UPDATE/DELETE as you have discovered. It is also possible to write a UDS (user defined service) that can pull data into the graph programmatically. It can do this in parallel which is fast if your sources can sustain it + supports enough simultaneous connections. Our ETL/Virtualization subsystem GDI is an general purpose example of a UDS. https://docs.cambridgesemantics.com/anzograph/v2.5/userdoc/gdi-intro.htm