Search code examples
solrdataimporthandlersolr6

Calling HTTPS URL with SOLR's DataImportHandler returns 403


(This took me a while, so I'm providing the Question and Answer thinking it's worth it.)

The URL from which the DataImportHandler has to retrieve the data is secured via HTTPS and an additional auth parameter. The configuration of the DataImportHandler looks like this:

<dataConfig>
    <dataSource type="URLDataSource"
                baseUrl="https://www.gutscheinpony.de/"
                encoding="UTF-8"/>
    <document>
        <entity name="pony"
                pk="id"
                url="feeds.xml?auth=XXX"
                processor="XPathEntityProcessor"
                forEach="/data/offers/offer"
                xsl="xslt/gutscheinpony.xsl">

            <!-- fields omitted -->

        </entity>
    </document>
</dataConfig>

Running this on a regular SOLR 6 installation will fail with a 403 Forbidden code while a quick test on the same URL via curl succeeds (showing only the interesting output):

curl https://www.gutscheinpony.de/feeds.xml?auth=XXX -Iv
> Host: www.gutscheinpony.de
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

Is it possible to set the User Agent for DataImportHandler connections without writing custom Java code?


Solution

  • The difference is that Java does not set the User Agent by default. Neither do SOLR nor the DataImportHandler fix this automatically for HTTPS connections.

    It is possible to set a User Agent value for a Java process using the System property http.agent. The value does only matter if the other server cares about it.

    Thus, the DataImportHandler will run fine when SOLR is started like this:

    bin/solr -f -Dhttp.agent="test/me"