I am trying to read a UTF-8-encoded file using ICU4C on Windows with msvc11. I need to determine the size of the buffer to build a UnicodeString. Since there is no fseek-like function in the ICU4C API I thought I could use an underlying C-file:
#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile, 0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];
There are two problems with this code:
So the questions are:
Edit:
Here is the possible solution (tested on msvc11 and gcc 4.8.1) based on the first answer and C++11 Standard. A few things from ISO IEC 14882 2011:
So, to make this portable for platforms where the implementation defined size of char is 1 byte = 8 bits (don't know where this isn't true) we can read Unicode characters into chars using unformatted input operation:
std::ifstream is;
is.open("utfICUfSeek.txt");
is.seekg(0, is.end);
int strSize = is.tellg();
auto inputCStr = new char[strSize + 1];
inputCStr[strSize] = '\0'; //add null-character at the end
is.seekg(0, is.beg);
is.read(inputCStr, strSize);
is.seekg(0, is.beg);
UnicodeString uStr = UnicodeString::fromUTF8(inputCStr);
is.close();
What troubles me is that I have to create an additional buffer for chars and only then convert them to the required UnicodeString.
This is an alternative to using ICU.
Using the standard std::fstream
you can read the whole/ part of the file into a standard std::string
then iterate over that with a unicode aware iterator. http://code.google.com/p/utf-iter/
std::string get_file_contents(const char *filename)
{
std::ifstream in(filename, std::ios::in | std::ios::binary);
if (in)
{
std::string contents;
in.seekg(0, std::ios::end);
contents.reserve(in.tellg());
in.seekg(0, std::ios::beg);
contents.assign((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
in.close();
return(contents);
}
throw(errno);
}
Then in your code
std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();
while ( iter != myString.end() )
{
...
++iter;
}