Search code examples
unixtrchars

How to separate unique characters from several words in a "indic" text file?


I've a plain text file.

> Input: इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री

All words are separated by one or more spaces. I want to collect all unique chars from the text file. I'm looking for a unix command; the order of the result chars is not important.

> Expected result: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री

With the command Klaus has provided

cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'

Result comes as:

ं अ इ क ग ज ट ड न र ल श सिीॅे्

I don't want to separate horizontal or vertical conjuncts or dependent vowels from its base character.

I just want to separate complete characters in a word from each other.

Can we achieve this with UNIX commands?

"base character" + "dependent vowel" = "complete character"

 -  क                   ा                        का 
 -  क                   ि                        कि

Klaus's command works for English text only. But, It doesn't work with indic languages such as Hindi.

Input: hi1 hello-2 how!3 "are4 ?you5

result: h i e l o w a r y u 1 2 3 4 5 - ! "

Note:- You have to install Indic support in your OS. Also, download Mangal font from http://hindi-fonts.com/fonts/Mangal


Solution

  • Try this:

    cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
    

    or simplified ( stolen from fedorqui comment, thanks! Never seen & before in the replacement part. Good to learn something new! )

    sed 's/./&\n/g' <file> | sort -u | tr -d '\n'