Search code examples
indexingasciinon-ascii-characterstransliteration

How to reduce a string to ASCII 7 characters for indexing purposes?


I am working on an application which must index certain sentences. Currently using Java and PostgreSQL. The sentences may be in several languages like French and Spanish using accents and other non-ASCII symbols.

For each word I want to create an index-able equivalent so that a user can perform a search insensitive to accents (transliteration). For example, when the user searches "nacion" it must find it even if the original word stored by the application was "Nación".

What could be the best strategy for this? I am not necessarily restricted only to PostgreSQL, nor the internal indexed value needs to have any similarity with the original word. Ideally, it should be a generic solution for converting any Unicode string into an ASCII string insensitive to case and accents.

So far I am using a custom function shown below which naively just replaces some letters with ASCII equivalents before storing the indexed value and does the same on query strings.

public String toIndexableASCII (String sStrIn) {
  if (sStrIn==null) return null;
  int iLen = sStrIn.length();
  if (iLen==0) return sStrIn;
  StringBuilder sStrBuff = new StringBuilder(iLen);
  String sStr = sStrIn.toUpperCase();

  for (int c=0; c<iLen; c++) {
    switch (sStr.charAt(c)) {
      case 'Á':
      case 'À':
      case 'Ä':
      case 'Â':
      case 'Å':
      case 'Ã':
        sStrBuff.append('A');
        break;
      case 'É':
      case 'È':
      case 'Ë':
      case 'Ê':
        sStrBuff.append('E');
        break;
      case 'Í':
      case 'Ì':
      case 'Ï':
      case 'Î':
        sStrBuff.append('I');
        break;
      case 'Ó':
      case 'Ò':
      case 'Ö':
      case 'Ô':
      case 'Ø':
        sStrBuff.append('O');
        break;
      case 'Ú':
      case 'Ù':
      case 'Ü':
      case 'Û':
        sStrBuff.append('U');
        break;
      case 'Æ':
        sStrBuff.append('E');
        break;
      case 'Ñ':
        sStrBuff.append('N');
        break;
      case 'Ç':
        sStrBuff.append('C');
        break;
      case 'ß':
        sStrBuff.append('B');
        break;
      case (char)255:
        sStrBuff.append('_');
        break;
      default:
        sStrBuff.append(sStr.charAt(c));
    }
  }

  return sStrBuff.toString();
}

Solution

  •     String s = "Nación";
    
        String x = Normalizer.normalize(s, Normalizer.Form.NFD);
    
        StringBuilder sb=new StringBuilder(s.length());
        for (char c : x.toCharArray()) {
            if (Character.getType(c) != Character.NON_SPACING_MARK) {
                sb.append(c);
            }
        }
    
        System.out.println(s); // Nación
        System.out.println(sb.toString()); // Nacion
    

    How this works: It splits up international characters to NFD decomposition (ó becomes o◌́), then strips the combining diacritical marks.

    Character.NON_SPACING_MARK contains combining diacritical marks (Unicode calls it Bidi Class NSM [Non-Spacing Mark]).