Search code examples
xmlsolrlucenedataimporthandlerstructured-data

Indexing structured dataset XML documents in Solr/lucene


I am trying to use Solr to index a small dataset of XML Documents, sample xml here:

<?xml version='1.0' encoding='utf-8'?>
<doc xmin = 0, xmax = 9.233174603174604>     
<title>John speech</title>
<description>shjshksjcjslkclsjk </description>
<uploaded_time>03/14/2010 08:44 PM</uploaded_time>
<likes>84906</likes>
<tier name="words">
<trans   xmin="0.0"  xmax="0.8325873015873018">silent</trans>
<trans   xmin="0.8325873015873018"   xmax="1.9564232192938984">Hi</trans>
<trans   xmin="1.9564232192938984"   xmax="3.874938884654082">I</trans>
<trans   xmin="3.874938884654082"    xmax="4.940780920965295">am</trans>
<trans   xmin="4.940780920965295"    xmax="6.495133890585815">John</trans>
:
:
</tier>
<doc>

Is this type of nested xml tags indexable by Solr? I tried the DataImportHandler with solrconfig.xml! and this xml-data-config.xml! (not sure about its correctness still have no clear understanding of how to deal with nested xml especially for the undetermined tier length)

But trying to do the dataimport, I receive :

Indexing ... Requests: 0 , Fetched: 0 , Skipped: 0 , Processed: 0

and it is kept for a long time although my small dataset only contains few short files.

What I am missing??


Solution

  • There were a lot of mistakes in my configuration files. The main problem was that I have to use "XPathEntityProcessor" as a processor for the entity of xml files datasource instead of TikaEntityProcessor. For undetermined length of field like "trans" it is to be added with multiValued="true" And the uploaded time needed to be in ISO-8601 format even after I added the DateFormatTransformer to the entity.