(This took me a while, so I'm providing the Question and Answer thinking it's worth it.)
The URL from which the DataImportHandler has to retrieve the data is secured via HTTPS and an additional auth
parameter. The configuration of the DataImportHandler
looks like this:
<dataConfig>
<dataSource type="URLDataSource"
baseUrl="https://www.gutscheinpony.de/"
encoding="UTF-8"/>
<document>
<entity name="pony"
pk="id"
url="feeds.xml?auth=XXX"
processor="XPathEntityProcessor"
forEach="/data/offers/offer"
xsl="xslt/gutscheinpony.xsl">
<!-- fields omitted -->
</entity>
</document>
</dataConfig>
Running this on a regular SOLR 6 installation will fail with a 403 Forbidden
code while a quick test on the same URL via curl
succeeds (showing only the interesting output):
curl https://www.gutscheinpony.de/feeds.xml?auth=XXX -Iv
> Host: www.gutscheinpony.de
> User-Agent: curl/7.43.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
Is it possible to set the User Agent for DataImportHandler
connections without writing custom Java code?
The difference is that Java does not set the User Agent by default. Neither do SOLR nor the DataImportHandler
fix this automatically for HTTPS connections.
It is possible to set a User Agent value for a Java process using the System
property http.agent
. The value does only matter if the other server cares about it.
Thus, the DataImportHandler
will run fine when SOLR is started like this:
bin/solr -f -Dhttp.agent="test/me"