Search code examples
javacontent-management-systemmetadatajackrabbit

Store metadata into Jackrabbit repository


can anybody explain to me, how to proceed in following scenario ?

  1. receiving documents (MS docs, ODS, PDF)

  2. Dublic core metadata extraction via Apache Tika + content extraction via jackrabbit-content-extractors

  3. using Jackrabbit to store documents (content) into repository together with their metadata ?

  4. retrieving documents + metadata

I'm interested in points 3 and 4 ...

DETAILS: The application is processing documents interactively (some analysis - language detection, word count etc. + gather as many details possible - Dublin core + parsing the content/events handling) so that it returns results of the processing to the user and then the extracted content and metadata(extracted and custom user metadata) stores into JCR repository

Appreciate any helps, thank you


Solution

  • Uploading files is basically the same for JCR 2.0 as it is for JCR 1.0. However, JCR 2.0 adds a few additional built-in property definitions that are useful.

    The "nt:file" node type is intended to represent a file and has two built-in property definitions in JCR 2.0 (both of which are auto-created by the repository when nodes are created):

    • jcr:created (DATE)
    • jcr:createdBy (STRING)

    and defines a single child named "jcr:content". This "jcr:content" node can be of any node type, but generally speaking all information pertaining to the content itself is stored on this child node. The de facto standard is to use the "nt:resource" node type, which has these properties defined:

    • jcr:data (BINARY) mandatory
    • jcr:lastModified (DATE) autocreated
    • jcr:lastModifiedBy (STRING) autocreated
    • jcr:mimeType (STRING) protected?
    • jcr:encoding (STRING) protected?

    Note that "jcr:mimeType" and "jcr:encoding" were added in JCR 2.0.

    In particular, the purpose of the "jcr:mimeType" property was to do exactly what you're asking for - capture the "type" of the content. However, the "jcr:mimeType" and "jcr:encoding" property definitions can be defined (by the JCR implementation) as protected (meaning the JCR implementation automatically sets them) - if this is the case, you would not be allowed to manually set these properties. I believe that Jackrabbit and ModeShape do not treat these as protected.

    Here is some code that shows how to upload a file into a JCR 2.0 repository using these built-in node types:

    // Get an input stream for the file ...
    File file = ...
    InputStream stream = new BufferedInputStream(new FileInputStream(file));
    
    Node folder = session.getNode("/absolute/path/to/folder/node");
    Node file = folder.addNode("Article.pdf","nt:file");
    Node content = file.addNode("jcr:content","nt:resource");
    Binary binary = session.getValueFactory().createBinary(stream);
    content.setProperty("jcr:data",binary);
    

    And if the JCR implementation does not treat the "jcr:mimeType" property as protected (i.e., Jackrabbit and ModeShape), you'd have to set this property manually:

    content.setProperty("jcr:mimeType","application/pdf");
    

    Metadata can very easily be stored on the "nt:file" and "jcr:content" nodes, but out-of-the-box the "nt:file" and "nt:resource" node types don't allow for extra properties. So before you can add other properties, you first need to add a mixin (or multiple mixins) that have property definitions for the kinds of properties you want to store. You can even define a mixin that would allow any property. Here is a CND file defining such a mixin:

    <custom = 'http://example.com/mydomain'>
    [custom:extensible] mixin
    - * (undefined) multiple 
    - * (undefined) 
    

    After registering this node type definition, you can then use this on your nodes:

    content.addMixin("custom:extensible");
    content.setProperty("anyProp","some value");
    content.setProperty("custom:otherProp","some other value");
    

    You could also define and use a mixin that allowed for any Dublin Core element:

    <dc = 'http://purl.org/dc/elements/1.1/'>
    [dc:metadata] mixin
    - dc:contributor (STRING)
    - dc:coverage (STRING)
    - dc:creator (STRING)
    - dc:date (DATE)
    - dc:description (STRING)
    - dc:format (STRING)
    - dc:identifier (STRING)
    - dc:language (STRING)
    - dc:publisher (STRING)
    - dc:relation (STRING)
    - dc:right (STRING)
    - dc:source (STRING)
    - dc:subject (STRING)
    - dc:title (STRING)
    - dc:type (STRING)
    

    All of these properties are optional, and this mixin doesn't allow for properties of any name or type. I've also not really addressed with this 'dc:metadata' mixin the fact that some of these are already represented with the built-in properties (e.g., "jcr:createBy", "jcr:lastModifiedBy", "jcr:created", "jcr:lastModified", "jcr:mimeType") and that some of them may be more related to content while others more related to the file.

    You could of course define other mixins that better suit your metadata needs, using inheritance where needed. But be careful using inheritance with mixins - since JCR allows a node to multiple mixins, it's often best to design your mixins to be tightly scoped and facet-oriented (e.g., "ex:taggable", "ex:describable", etc.) and then simply apply the appropriate mixins to a node as needed.

    (It's even possible, though much more complicated, to define a mixin that allows more children under the "nt:file" nodes, and to store some metadata there.)

    Mixins are fantastic and give a tremendous amount of flexibility and power to your JCR content.

    Oh, and when you've created all of the nodes you want, be sure to save the session:

    session.save();