Search code examples
python-3.xwebscrapyweb-crawler

How to check if text is Japanese Hiragana in Python?


I'm making a web crawler using python scrapy to collect text from websites.

I only want to collect Japanese Hiragana text. Is there a solution to detect Japanese Hiragana text?


Solution

  • Assuming you only need Hiragana, and you can convert your text to unicode / utf8:

    Hiragana is Unicode code block U+3040 - U+309F, so you could test it with:

    def char_is_hiragana(c) -> bool:
        return u'\u3040' <= c <= u'\u309F'
    def string_is_hiragana(s: str) -> bool:
        return all(char_is_hiragana(c) for c in s)
    
    print('ぁ', string_is_hiragana('ぁ'))
    print('ひらがな', string_is_hiragana('ひらがな'))
    print('a', string_is_hiragana('a'))
    print('english', string_is_hiragana('english'))
    
    ぁ True
    ひらがな True
    a False
    english False
    

    But note that this excludes historic and non-standard hiragana (hentaigana), whitespace, punctuation, Katakana and Kanji:

    # hiragana
    print('ひらがな', string_is_hiragana('ひらがな'))
    # katakana
    print('カタカナ', string_is_hiragana('カタカナ'))
    # kanji
    print('漢字', string_is_hiragana('漢字'))
    # punctuation
    print('ひらがなもじ「ゆ」', string_is_hiragana('ひらがな「ゆ」'))
    print('いいひと。', string_is_hiragana('いいひと。'))
    
    ひらがな True
    カタカナ False
    漢字 False
    ひらがなもじ「ゆ」 False
    いいひと。 False
    

    You could allow Whitespace:

    import string
    def string_is_hiragana_or_whitespace(s: str) -> bool:
        return all(c in string.whitespace or char_is_hiragana(c) for c in s)
    
    print('ひらがな  ひらがな', string_is_hiragana_or_whitespace('ひらがな  ひらがな'))
    
    ひらがな  ひらがな True
    

    But I would avoid going down this path of being too specific, there are a lot of difficult problems, like encoding, half-width characters, emoji, CJK code blocks, loan words, etc.