Search code examples
lucenesearch-engine

How do I handle special characters like(#) in OpenSearchServer / Lucene?


I am using OpenSearchServer (community edition) v1.2.4-rc3 - stable - rev 1474 - build 802. I crawl a C# and C++ programming website. Now when i search for C# or C++ the software strips special characters like #,+. The results are not exact which software returns. How do I handle special characters like(#) in OpenSearchServer / Lucene? Can any one please suggest me idea? Thanks in advance


Solution

  • You need to change your indexing strategy to use a custom or semi-custom tokenizer that preserves the special characters you need to represent C# and C++ code terms. You would use this tokenizer both during indexing and during searching.

    Off-hand, I would look at org.apache.lucene.analysis.standard and org.apache.lucene.wikipedia.analysis to get some ideas as how to construct the tokenizer (using a tokenizer (lexical analyzer) generator like JFlex etc. may be called for rather than hand-coding the tokenizer).