Search code examples
javajpajcrjackrabbitdms

JCR vs JPA for a DMS: performance, benefits, drawbacks


After doing some research about JCR or RDBMS, and reading other posts, I am still uncertain whether to use JCR over JPA for a document management system, which has to deal with different document types, very large files and a lot of concurrent access from many users.

My main reason to consider JCR is because documents look like content to me, and the specification already deals with some problems that comes with it - mostly I am interested in storage and versioning. Also I would like to sort of encapsulate the document stuff within a JCR implementation and use JPA for everything else application specific.

Maybe someone can help me with my remaining questions:

  • How does the read/query performance of JCR relate to JPA (I know it should vary greatly on the implementation, but there might be some rules of thumb)?
  • Does anybody have real world experience in a simillar use case with some specific JCR implemenations? If so, did you mix it with a relational database (JPA)?
  • Is it worth the overhead of introducing JCR considering it's benefits of filestorage and versioning? (I am likely going to my own custom use access control (JPA) and I will not need the extra flexibility to introduce new node properties within runtime)
  • Does anybody have any experience about data integrity and backup solutions?

UPDATE: even though this question has been answered in detail, somebody might have a more critical sight about its use from a more practical point of view. Personally I am getting more and more concerned about the following non technically related issues:

  1. Documentation: Jackrabbit has poor documentation, it's guide to OCM contains a dead link in the first paragraph, some example search queries throw exceptions for unknown reasons, there is a TODO in a very basic tutorial and it's standalone server is not working properly within JDK8 which is not documented at all.
  2. Maturity: Jackrabbit Oak seems to be still work in progress and the other solutions look like either being abandoned or bleeding edge.
  3. Community: In opposite to JPA, doing research of JCR leads to way less hits. This could be a real problem, when a project team new to the technology gets stuck within (trival) problems.

Solution

  • Short version: Documents are structured or semi-structured content. Thats THE use-case for a hierarchically organized data-storage. You should go for JCR if you don't want to implement all the basic dms/cms stuff for yourself (consider this, you're probably doing it the first time, while they were doing it all the time).

    Long version: JCR covers much of the basic use cases of document or content management systems by specification, like versioning, locking, lifecycle management or referential integrity. Further it allows you to extend your data without changing the schema (of course you can define your node types in a model, but you don't have to). Most of the JCR implementations (like Jackrabbit) use a database in the backend making them "little more" than an abstraction layer over your relational backend. To deal with large data, you can use the filesystem storage (which is much faster than storing every binary data to the database) while storing the structured data (nodes and properties) in the database.

    When going for JPA you have to deal with all this dms/cms stuff by yourself. Of course you can do it, but it is much more low-level programming that has been done in the JCR implementation already. Every model change requires a schema change, and the table layout is not that trivial (do you want to have a big table for your documents, with every property being a column? do you want to have a separate table for each document class? how do you model lifecycles, how do you model versioning?)

    For the first hops with JCR, I'd recommend David's Model, consider everything of your application as content. I had worked in a project, where we decided against a mix of JCR and JPA so we don't have to deal with different APIs for storage.

    And there are at least some JCR implementations out there

    • Jackrabbit 2 (Reference implementation, optimized for read operations, currently in maintenance mode)
    • Jackrabbit OAK (aiming for highly scalable content repositories, with balancing of read/write performance. it's from the same core team as Jackrabbit)
    • Jackrabbit FileVault (Backend purely on filesystem)
    • Modeshape (alternative implementation, fast and scalable, with REST API, quite good documentation around)

    Btw. the JCR API and implementations are done pretty much with RESTful architecture in mind. So if you consider a REST API, the mapping is rather simple, too. Further, it allows consumer to explore the content directly via JCR API making it easy to integrate the content in other applications (i.e. read-only) while you have to reveal the internal design of the your database with JPA making consumer contracts more likely to break on changes.

    Regarding your remaining questions:

    • I have no comparision charts and as usual it depends on the data structure and indices and your query design. JCR implementations have built-in caching and you're typically iterating over result sets. So there is no general statement regarding faster/slower, it all depends on the use case.
    • I have done a similar thing and we were satisfied with the Jackrabbit implementation, but we were on JDK7. We had all data (including user settings, application settings etc) in the repository and no JPA persistence at all. There is also an Object Content Mapping available, if you need it.
    • Yes it is worth introducing. Jackrabbit has it's own user management available - you don't have to implement that yourself. And access control is available through the JCR API and JAAS. Though I recommend not using the JCA ResourceAdapter for administering user management and access control, as it does not expose the Jackrabbit API.
    • The question regarding data integrity and backup are not special to JCR or JPA, both ensure integrity at some level (database integrity, JCR does the referential integrity) and both can be backed up (db backup, fs backup). And both are a standardized way of accessing data so you can even do your own backup logic.