Search code examples
solrlucenesolrcloud

SOLR - configuring schema.xml for xml data


I am trying to index the wikitravels data using solr installed on my windows OS. Below is the sample input data:

<?xml version="1.0" encoding="UTF-8"?>

<add> 
  <page> 
    <title>3Days 2Night Chiang Mai to Chiang Rai</title>  
    <id>83509</id>  
    <revision> 
      <id>1305791</id>  
      <timestamp>2009-11-27T10:35:53Z</timestamp>  
      <contributor> 
        <username>Texugo</username>  
        <id>7666</id>  
        <realname/> 
      </contributor>  
      <comment>[[3Days 2Night Chiang Mai to Chiang Rai]] moved to [[Chiang Mai to Chiang Rai in 3 days]]</comment>  
      <text xml:space="preserve">#REDIRECT [[Chiang Mai to Chiang Rai in 3 days]]</text> 
    </revision> 
  </page> 
</add>

In my schema.xml, i have added the following changes:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

<uniqueKey>id</uniqueKey>

Upon Posting, it doesn't show any error; however in SOLR web it doesnt show the data. Nor, i can see any error in the logs.

$ java -jar post.jar wiki.xml
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
POSTing file wiki.xml
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
Time spent: 0:00:00.342

Solution

  • As @notdang said, Solr input XML has a particular form. You can:

    1. Send data in the XML format Solr expects
    2. Use DataImportHandler which can parse XML
    3. Pre-process XML with XSLT on the way in to make it look like XML Solr expects.
    4. Use JSON and pre-process that

    I suspect that option 2 (DataImportHandler) might be the easiest if you are using third party XML files. Also, DIH can import very large XML files as it reads them. Posting large files to Solr may hit a size limit.