Search code examples
.netglobalization

Canonicalize string using CultureInfo and CompareOptions


Currently I have code, which based on a CultureInfo cultureInfo = new CultureInfo("ja-JP") does a search using

bool found = cultureInfo.CompareInfo.IndexOf(x, y,
    CompareOptions.IgnoreCase | 
    CompareOptions.IgnoreKanaType | 
    CompareOptions.IgnoreWidth
) >= 0;

As doing a x.IndexOf(y) is way faster, and my xes are plenty and rarely change, I'd like to canonicalize the xes once, and when performing the search do a simple

canonicalizedX.indexOf(canonicalize(y));

My question: Is there anything in the .net libraries which I could use do implement the canonicalize() function, using my CultureInfo and CompareOptions?


Solution

  • I ended up using LCMapStringEx and it works fine for me. It is not based upon (an arbitrary set of) CompareOptions, but the CompareInfo.GetSortKey docs lead me to LCMapString, so the effect of my indexOf of canonicalized strings should be yield the same result as CultureInfo.CompareInfo.IndexOf, using the hardcoded CompareOptions, here called dwMapFlags:

    public static string Canonicalize(string src)
    {
        string localeName = "ja-JP";
        string nResult = src;
    
        int nLen, nSize;
    
        uint dwMapFlags = LCMAP_LOWERCASE | LCMAP_HIRAGANA | LCMAP_FULLWIDTH;
        IntPtr ptr, pZero = IntPtr.Zero;
    
        nLen = src.Length;
        nSize = LCMapStringEx(localeName, dwMapFlags, src, nLen, IntPtr.Zero, 0, pZero, pZero, pZero);
        if (nSize > 0)
        {
            nSize = nSize * sizeof(char);
            ptr = Marshal.AllocHGlobal(nSize);
            try
            {
                nSize = LCMapStringEx(localeName, dwMapFlags, src, nLen, ptr, nSize, pZero, pZero, pZero);
                if (nSize > 0) nResult = Marshal.PtrToStringUni(ptr, nSize);
            }
            finally
            {
                Marshal.FreeHGlobal(ptr);
            }
        }
    
        return nResult;
    }
    
    [DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
    static extern int LCMapStringEx(
         string lpLocaleName,
         uint dwMapFlags,
         string lpSrcStr,
         int cchSrc,
         [Out]
         IntPtr lpDestStr,
         int cchDest,
         IntPtr lpVersionInformation,
         IntPtr lpReserved,
         IntPtr sortHandle);
    
    private const uint LCMAP_LOWERCASE = 0x100;
    private const uint LCMAP_UPPERCASE = 0x200;
    private const uint LCMAP_SORTKEY = 0x400;
    private const uint LCMAP_BYTEREV = 0x800;
    private const uint LCMAP_HIRAGANA = 0x100000;
    private const uint LCMAP_KATAKANA = 0x200000;
    private const uint LCMAP_HALFWIDTH = 0x400000;
    private const uint LCMAP_FULLWIDTH = 0x800000;
    

    I also tried Microsoft.VisualBasic.StrConv, which works, but is twice as slow as pinvoking LCMapStringEx.