Search code examples
solrcassandradatastax-enterprise

Splitting/Mapping Cassandra concatenated values into Solr fields


I'm using Solr and Cassandra (via DSE). Here is one entry (row) of data in Cassandra:

ORDER_INFO_CF
 -orderHistoryID=1000072459
   -SPECIAL_COLUMN_KEY=0800000002||1294034400000|113942

I can index the Cassandra data without an issue, with this schema.xml:

<schema name="ORDER_INFO_CF" version="1.1">
 <types>
  <fieldType name="string" class="solr.StrField"/>
  <fieldType name="text" class="solr.TextField">
    <analyzer><tokenizer class="solr.WikipediaTokenizerFactory"/></analyzer>
  </fieldType>
 </types>
 <fields>
    <field name="orderHistoryID"  type="string" indexed="true"  stored="true"/>
    <field name="SPECIAL_COLUMN_KEY"  type="text" indexed="true"  stored="true"/>
 </fields>

Of course, having all the data lumped into one pipe-delimited string doesn't help very much. So I tried to split it using the PatternTokenizerFactory, like this (schema.xml):

<schema name="ORDER_INFO_CF" version="1.1">
 <types>
  <fieldType name="string" class="solr.StrField" />
  <fieldType name="splitField" class="solr.TextField">
   <analyzer><tokenizer class="solr.PatternTokenizerFactory" pattern="|" /></analyzer>
  </fieldType>
 </types>
 <fields>
    <field name="orderHistoryID"  type="string" indexed="true"  stored="true"/>
    <field name="AccountNumber"  type="splitField" indexed="true"  stored="true"/>
    <field name="ActionFlag"  type="splitField" indexed="false"  stored="true"/>
    <field name="CreatedDate"  type="splitField" indexed="true"  stored="true"/>
    <field name="CreatedTime"  type="splitField" indexed="true"  stored="true"/>
 </fields>

orderHistoryID is still being mapped, but the SPECIAL_COLUMN_KEY value is not being split into the four fields described above. I'm sure that I'm just not doing something quite right with the PatternTokenizerFactory. I've also looked at the DataImportHandler RegexTransformer, but that only seems to works with RDBMS and XML imports.

Essentially, my data maps like this in Solr:

orderHistoryID=1000072459
SPECIAL_COLUMN_KEY=0800000002||1294034400000|113942

And I'm trying to get it to map like this:

orderHistoryID=1000072459
AccountNumber=0800000002
ActionFlag=
CreatedDate=1294034400000
CreatedTime=113942

Could someone please point me in the right direction?


Solution

  • An easier way to solve this problem would be to use Solrj . Assuming that you already have an api to read records from cassandra, you will be able to feed it to solr using Solrj.

    The other way would be to create a custom POJO and then use . For example -

    import org.apache.solr.client.solrj.beans.Field;
    
    public class CustomRecord {
       @Field
       private String orderHistoryID;
       @Field
       private String AccountNumber;
       @Field
       private String ActionFlag;
       @Field
       private String CreatedDate;
       @Field
       private String CreatedTime;
    }
    

    and then use

    SolrServer server = new HttpSolrServer("http://HOST:8983/solr/");
    server.addBean(customRecord);
    

    For more details, refer to directly adding pojos to solr.