Search code examples
xmlsolrdataimporthandlerdata-importschemaless

Indexing entire XML document on SOLR 7 with no field specification


I would try to put an xml document on SOLR (now i'm using 7.3.0 version) without set specific fields in data-config or putting one tag to get all the others. I tried with schemaless mode but I didn't get any document back. Is it possible to do this thing in some way, or SOLR can't handle it?

This is an example of my SOLR document.xml. I would like to detect all tags and getting back relative values without edit any fields. Like i said, i tried with schemaless mode and it didn't work.

<?xml version="1.0" encoding="UTF-8"?>
<digital_archive xmlns="https://www.site" dataCreazione="2017-05-11T17:15:00">
<DocumentalCategory>some data</DocumentalCategory>
<customer>some data</customer>
<producer>some data</producer>
<documentOwner>some data</documentOwner>
<sources>
    <source>
        <idc>
            <id scheme="adfr">some data</id>
            <name>some data</name>
            <path>sources\source\some_path.XML</path>
            <hash alg="SHA-256">3748738</hash>
        </idc>
        <vdc>
            <id scheme="some data">some data.XML</id>
            <timeReference>2017-03-17T14:19:01+0100</timeReference>
        </vdc>
    </source>
</sources>
<ud>
    <metadati>
        <Name>Jane</Name>
        <Surname>Doe</Surname>
        <FiscalCode>dsrsd6w7hedw</FiscalCode>
        <Date>29.10.2017</Date>
    </metadati>

The result that i expect is something like this:

    <field name="DocumentalCategory">some data</DocumentalCategory>
<field name="customer">some data</customer>
<field name="producer">some data</producer>
<field name="documentOwner">some data</documentOwner>
<field name="sources">
    <field name="source">
        <field name="idc">
            <field name="id" scheme="adfr">some data</id>
            <field name="name">some data</name>
            <field name="path">sources\source\some_path.XML</path>

Solution

  • Solr is not a database, it is a search engine. Its goal is to give you good search results with preservation of original structure being less important.

    While there are some ways to take in nested documents, you will find that your searches afterwards will make you really rethink your import process.

    So, I would recommend you step back and think about how you would want to find this information first and what level record/subrecord would be returned. Then you can revisit the import question.

    Schemaless mode is not going to help you here, as it still expects your document to be in Solr format, whether XML, JSON or CSV. You have a custom XML format here. So, you need to transform it somehow. You can either use Data Import Handler and define the mapping or XSLT transform on the way in to make it match Solr's expectations. Either way, you would have to do some flattening and id mapping, most likely.