I am trying to use Solr to index a small dataset of XML Documents, sample xml here:
<?xml version='1.0' encoding='utf-8'?>
<doc xmin = 0, xmax = 9.233174603174604>
<title>John speech</title>
<description>shjshksjcjslkclsjk </description>
<uploaded_time>03/14/2010 08:44 PM</uploaded_time>
<likes>84906</likes>
<tier name="words">
<trans xmin="0.0" xmax="0.8325873015873018">silent</trans>
<trans xmin="0.8325873015873018" xmax="1.9564232192938984">Hi</trans>
<trans xmin="1.9564232192938984" xmax="3.874938884654082">I</trans>
<trans xmin="3.874938884654082" xmax="4.940780920965295">am</trans>
<trans xmin="4.940780920965295" xmax="6.495133890585815">John</trans>
:
:
</tier>
<doc>
Is this type of nested xml tags indexable by Solr? I tried the DataImportHandler with solrconfig.xml! and this xml-data-config.xml! (not sure about its correctness still have no clear understanding of how to deal with nested xml especially for the undetermined tier length)
But trying to do the dataimport, I receive :
Indexing ... Requests: 0 , Fetched: 0 , Skipped: 0 , Processed: 0
and it is kept for a long time although my small dataset only contains few short files.
What I am missing??
There were a lot of mistakes in my configuration files. The main problem was that I have to use "XPathEntityProcessor" as a processor for the entity of xml files datasource instead of TikaEntityProcessor. For undetermined length of field like "trans" it is to be added with multiValued="true" And the uploaded time needed to be in ISO-8601 format even after I added the DateFormatTransformer to the entity.