Search code examples

SOLR and accented characters

I have an index for occupations (identifier + occupation):

<field name="occ_id" type="int" indexed="true" stored="true" required="true" />
<field name="occ_tx_name" type="text_es" indexed="true" stored="true" multiValued="false" />

<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>

This is a real query, for three identifiers (1, 195 and 129):

curl -X GET ""
      "q":"occ_id:1 occ_id:195 occ_id:129",

Two of them have accented characters, and one not. So let’s search by occ_tx_name without using accents:

curl -X GET ""

curl -X GET ""

curl -X GET ""

I am very annoying about the fact that the last search ‘osteopata’ fails, while ‘informatico’ succeed. The source data for the index is a simple MySQL table:

-- -----------------------------------------------------
-- Table `mydb`.`occ_occupation`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`occ_occupation` (
  `occ_tx_name` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`occ_id`)

The collation of the table is “utf8mb4_general_ci”. The index is created with DataImportHandler. This is the definition:

    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://"
        user=“mydb” password=“mydb” />
    <document name="occupations">
        <entity name="occupation" pk="occ_id"
            query="SELECT occ.occ_id, occ.occ_tx_name FROM occ_occupation occ WHERE occ.sta_bo_deleted = false">
            <field column="occ_id" name="occ_id" />
            <field column="occ_tx_name" name="occ_tx_name" />

I need some clue to detect the problem. Can anyone help me? Thanks in advance.


  • Ok, I have discovered the source problem. I have opened my SQL load script with VI, in hex mode.

    This is the hex content for 'Agrónomo' in an INSERT statement: 41 67 72 6f cc 81 6e 6f 6d 6f.

    6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!

    So that's the problem... It must be "c3 b3"... I get the literals copy/pasting from a web page, so the source characters on the origin was the problem.

    Thanks to both of you, because I have learning more about SOLR's soul.
