Indexing UTF8 Data from MySQL into SOLR 4 using DataImportHandler

I have a setup where I have a MySQL DB using Amazon's RDS (version 5.5.x). I have confirmed that the DB is configured for UTF8. I can insert and select the chinese characters using PHP. It looks fine.

I have SOLR running on Tomcat6 (the stock Tomcat6 that comes from Ubuntu 10.04LTS)

The issue now is plugging SOLR 4.0 into it using MySQL's own JDBC driver.

I'll provide the relevant configs below, but the issue I'm trying to solve is once I index my DB table into SOLR, the ASCII chars are fine (of course), but the unicode characters, such as Chinese, show up as jiberish. I cannot query on those chars either (but querying on English words works great!).

Has anyone encountered this issue?

Here's my db conf xml file that is used for the DataImportHandler:

<dataConfig>
        <dataSource     type="JdbcDataSource"
                        driver="com.mysql.jdbc.Driver"
                        url="jdbc:mysql://db.host.com/db_name?useUnicode=yes&amp;characterEncoding=UTF-8&amp;characterSetResults=UTF-8"
                        user="db_username"
                        password="db_password"
        />

        <document name="vspcm">
                <entity name="conversations"
                        query="select query from mysql"
                >
                </entity>
        </document>
</dataConfig>

I've played around with the parameters in the url variable above, changing utf8 to UTF-8, etc.. nothing sticks. (I've changed the query to remove my table structure, but i confirmed that it works - delta imports work fine)

For schema.xml, I started from the collection1/conf/schema.xml, removed all the unneccessary fields, and added my own. All text fields (that are varchars in mysql) are set to text_general Example field:

<field name="msg"       type="text_general"     indexed="true"  stored="true"  multiValued="true"/>

(the field above would have unicode characters in it.

The text_general field definition is:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Just to make sure that SOLR plays nice with unicode chars, I imported the example utf8-example.xml schema that is provided in the download of SOLR 4 into "collection1", and it does indeed idex the example chars. I added some Chinese to the xml file, imported it, and queried against collection1, and I get chinese characters.

So I suspect it is somewhere in the DataImportHandler.

Has anyone encountered this before and come up with a solution?

Thank you so much! Chris

Solution

Fixed it - was actually a DB issue.