Using Umbraco v6, Examine search (not full blown Lucene queries). This is a Latin/South American website. I've asked my colleagues how they type in tittles (accent mark over a letter) for search/URL, and they all said that they don't, they just use "regular" characters (A-Z, a-z).
I know how to strip special characters OUT of the string when passing to Examine, but I need the other way around, as in Examine removing the special characters from properties to match to the query. I have numerous "nodes" that have tittles in the name (which is one of the properties that I am searching on).
Posts that I've researched:
I've tried writing the luence query (or so I think) but I'm not getting in any hits.
// q is my query from QueryString
var searcher = ExamineManager.Instance.SearchProviderCollection["CustomSearchSearcher"];
//var query = searcher.CreateSearchCriteria().Field("nodeName", q).Or().Field("description", q).Compile();
//var searchResults = searcher.Search(query).OrderByDescending(x => x.Score).TakeWhile(x => x.Score > 0.05f);
var searchResults = searcher.Search(Global.RemoveSpecialCharacters(q), true).OrderByDescending(x => x.Score).TakeWhile(x => x.Score > 0.05f);
Global Class
public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; i++)
{
if ((str[i] >= '0' && str[i] <= '9')
|| (str[i] >= 'A' && str[i] <= 'z' || (str[i] == '.' || str[i] == '_'))
|| str[i] == 'á' || str[i] == 'é' || str[i] == 'í' || str[i] == 'ñ' || str[i] == 'ó' || str[i] == 'ú')
{
sb.Append(str[i]);
}
}
return sb.ToString();
}
As stated above, I need special characters (tittles) removed from Lucene, not the query passed in.
From: https://our.umbraco.org/documentation/reference/searching/examine/overview-explanation
I've also read about "Analyzers", but I have never worked with them before, nor know which one(s) to get/install/add to VS, etc. Is that the better way to go about this??
A custom analyzer is the answer.
This is answered on the umbraco forum here: https://our.umbraco.org/forum/developers/extending-umbraco/16396-Examine-and-accents-for-portuguese-language
Make a analyzer that strips all special characters:
public class CIAIAnalyser : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
StandardTokenizer tokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
tokenizer.SetMaxTokenLength(255);
TokenStream stream = new StandardFilter(tokenizer);
stream = new LowerCaseFilter(stream);
return new ASCIIFoldingFilter(stream);
}
}
Then you do the same for the search input.
public class CleanAccent
{
public static string RemoveDiacritics(string input)
{
// Indicates that a Unicode string is normalized using full canonical decomposition.
if (String.IsNullOrEmpty(input)) return input;
string inputInFormD = input.Normalize(NormalizationForm.FormD);
var sb = new StringBuilder();
for (int idx = 0; idx < inputInFormD.Length; idx++)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(inputInFormD[idx]);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(inputInFormD[idx]);
}
}
return (sb.ToString().Normalize(NormalizationForm.FormC));
}
}
then reference the analyzer in ExamineSettings.config.