I'm having trouble with the encoding of solr. We have the "same" setup on two different servers, but one of them is able to index the document without ??? characters but the test server is having trouble with that.
Exemples of Solr results:
Prod Server :
effet sur l’acquisition des connaissances »\n\n#12;#12;EFFET D’UNE SÉQUENCE
Test Server :
effet sur l’acquisition des connaissances »\n\n��EFFET D’UNE SÉQUENCE D’ENSEIGNEMENTS
I have the same Version of java running on both servers :
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
Both have the same Java Options :
JAVA_OPTS=" -Dfile.encoding=UTF-8 "
Both solr have the same Java properties(in the UI).
What does the #12; mean?
Where could the problem be?
OS:
Software:
EDIT : OUtput of locale on both servers :
LANG=en_CA.utf8
LANGUAGE=en_CA:en
LC_CTYPE="en_CA.utf8"
LC_NUMERIC="en_CA.utf8"
LC_TIME="en_CA.utf8"
LC_COLLATE="en_CA.utf8"
LC_MONETARY="en_CA.utf8"
LC_MESSAGES="en_CA.utf8"
LC_PAPER="en_CA.utf8"
LC_NAME="en_CA.utf8"
LC_ADDRESS="en_CA.utf8"
LC_TELEPHONE="en_CA.utf8"
LC_MEASUREMENT="en_CA.utf8"
LC_IDENTIFICATION="en_CA.utf8"
LC_ALL=
Thank you!
The problem was not in the encoding but the way DSpace works. I had to run the command the :
./dspace filter-media -f
This command will regenerate the .txt file from PDF and reindex the document. So every time I was trying to index the document with the correct encoding, it was not changing anything.