I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM
0xFFFE
in the beginning and then ASCII
character output with nulls between characters (i.e "F.i.e.l.d.1.
"). I can use iconv
to convert this to UTF-8
using UCS-2LE
as an input format and UTF-8
as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE
file into strings and parse out the field values and then write them out to a ASCII
text file (i.e. Field1 Field2
). I have tried the string
and wstring
-based versions of getline
– while it reads the string from the file, functions like substr(start, length)
do interpret the string as 8-bit
values, so the start and length values are off.
How do I read the UCS-2LE
data into a C++
String and extract the data values? I have looked at boost
and icu
as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf
contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s.
" then the substr()
above returns ".k. i.n. g.e
" instead of "g.e.n.e.r.a.l.i.t.i.e.s.
".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost
(or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL
does not understand wide character strings?
Thanks!
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.