Search code examples
batch-fileunicodefindstr

Is there a way to use FINDSTR with non-ASCII (in this case Japanese/Chinese) characters in batch?


I have a list of Japanese Kanji and their pronunciations saved in a text file (JouyouKanjiReadings.txt) like this

亜   ア
哀   アイ,あわれ,あわれむ
愛   アイ
悪   アク,オ,わるい
握   アク,にぎる
圧   アツ
(each gap is made by pressing TAB)

and I have a script like this

@echo off
set /p text=Enter here: 
echo %text%>Search.txt
echo.
findstr /G:"Search.txt" JouyouKanjiReadings.txt || echo No Results && pause > nul && exit
pause > nul

However, when I run the script, I always get "No Results". I tried with English characters and it worked fine. I also tried the same script with this

findstr "%text%" JouyouKanjiReadings.txt || echo No Results && pause > nul && exit

but got the same results. Is there any ways to get around this? Also, I'm displaying the these characters correctly in the command prompt by using

chcp 65001

and a different font.


Solution

  • You need to use find (which supports Unicode but not regex) instead of findstr (which supports regex but not Unicode). See Why are there both FIND and FINDSTR programs, with unrelated feature sets?

    D:\kanji>chcp
    Active code page: 65001
    
    D:\kanji>find "哀" JouyouKanjiReadings.txt
    
    ---------- JOUYOUKANJIREADINGS.TXT
    哀      アイ,あわれ,あわれむ
    

    Redirect to NUL to suppress the output if you don't need it

    That said, find isn't a good solution either. Nowadays you should use PowerShell instead of cmd with all of its quirks due to compatibility legacy issues. PowerShell fully supports Unicode and can run any .NET framework methods. To search for strings you can use the cmdlet Select-String or its alias sls

    PS D:\kanji> Select-String '握'  JouyouKanjiReadings.txt
    
    JouyouKanjiReadings.txt:5:握    アク,にぎる
    

    If fact you don't even need to use UTF-8 and codepage 65001. Just store the file in UTF-16 with BOM (that'll result in a much smaller file because your file contains mostly Japanese characters), then find and sls will automatically do a search in UTF-16

    Of course if there are a lot of existing batch code then you can call PowerShell from cmd like this

    powershell -Command "Select-String '哀'  JouyouKanjiReadings.txt"
    

    But if it's entirely new then please just avoid the hassle and use PowerShell