Search code examples
fontscharactercjktruetype

Where can I find lists of the most common characters for the simplified Chinese, Japanese and Korean languages?


I am developing a simple mobile puzzle game and I am currently working on the localization aspect of it. I want to localize it in Simplified Chinese, Japanese and Korean. I am planning to use the noto-cjk fonts collection, but the issue is the font size are very big due to the amount of glyphs used. Since my game doesn't have much text, I doubt I need all of those glyphs.

I have a way of creating a font subset of only the characters I use in my game, but I would like to have more than the bare minimum, hence the title of this question.

Where can I find the 3000-5000 most commonly used characters for each of those languages specifically?


Solution

  • Unicode has almost 94,000 CJK ideographs. But based on expert input from the China, Japan and Korean standards bodies has defined a subset called "II Core" with 9,810 ideographs that are considered (as of 2001) a minimal set required for East Asian markets. See http://www.unicode.org/reports/tr38/#kIICore for additional info.

    There is also an updated subset that was defined in 2020, UnihanCore2020. See http://www.unicode.org/reports/tr38/#kUnihanCore2020 for additional info.

    You can find PDFs with these repertoires at http://www.unicode.org/charts/unihan.html. This information can also be extracted programmatically from Unihan data files (in Unihan.zip) that are part of the Unicode Character Database—see https://www.unicode.org/ucd/.

    These may be more than you're looking for, but you'll likely want some subset of the II Core set.