character-encoding command-line-interface stanford-nlp

swiftly generate and sort full encoding dictionary and corresponding primary radicals

Chinese characters, according to the unihan encoding schema, can be indexed by their primary radical.

The Stanford Word Segmenter has a command that can execute this, as described in their documentation i.e.

java -cp stanford-segmenter-VERSION.jar
edu.stanford.nlp.trees.international.pennchinese.RadicalMap
-infile whitespace_seperated_chinese_characters.input
> each_character_denoted_by_radical.output

I want to create a comprehensive table of chinese characters organized by their primary radical, I suppose I can use the function

public static java.util.Set getChars(char ch)

What are the Characters with this primary radical?

public static char getRadical(char ch)

What is the primary radical of this char?

But my question is, what is the most efficacious way to accomplish this goal? and furthermore to output the result in the form of a table, à la this Wikipedia table (not exactly like that table, but, shall we say, suggestive of that table).

That Stanford tool uses the CC-CEDIT dictionary. Is it possible I could just download that dictionary and feed it in? If so, how?

Maybe than Stanford tool already contains this as part of the code, but how to access it?

Solution

This information is encoded in exactly the form you want in the RadicalMap source code.

See the static initializer:

String[] radLists = {"\u4e00\u4e00\u4e01\u4e02\u4e03...", "...", ..., };

Each string in this list has as its first character a radical, and the remaining characters have that first character as their primary radical.

It's a package-local static variable, so there's not exactly a clean way to access it programmatically.. but you could easily rip its definition out of the source code and use it for whatever need you have.