Search code examples
c++visual-c++wofstreamwifstream

how to get a single character from UTF-8 encoded URDU string written in a file?


i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows

#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");

if(!inputfile)
{
    cerr<<"File not open"<<endl;
    exit(1);
}

while (!inputfile.eof())         // i am using this while just to 
                                     // make sure copy-paste operation of
                                     // written urdu text from one file to
                                     // another when i try to pick only one character
                                     // from file, it does not work. 

{   inputfile>>arry;   }
    int i=0;
    while(arry[i] != '\0')           // i want to get urdu character placed at 
                                     // each-index so that i can work on it to convert
                                     // it into its equivalent hindi character
{ outputfile<<arry[i]<<endl; 
      i++; }
     inputfile.close();
 outputfile.close();
cout<<"Hello world"<<endl;
   }

Solution

  • Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.

    You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.

    HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.

    EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".