I have a file like below:
$ xxd 1line
0000000: 3939 ba2f 6f20 6f66 0d0a 99./o of..
I would like to read this one line in C++:
#include <codecvt>
#include <iostream>
#include <locale>
#include <fstream>
#include <string>
int main(int argc, char** argv) {
std::wifstream wss(argv[1], std::ios::binary);
wss.seekg(std::ios_base::end);
const auto fileSize = wss.tellg();
wss.seekg(std::ios_base::beg);
// std::locale utf8_locale(wss.getloc(), new std::codecvt_utf8<wchar_t, 0x10FFFF, std::consume_header>);
// wss.imbue(utf8_locale);
std::wstring wline;
std::getline(wss, wline);
std::cout << "filelen: " << fileSize << std::endl;
std::cout << "strlen: " << wline.size() << std::endl;
std::wcout << "str: " << wline << std::endl;
return 0;
}
I compile it in below way:
$ g++ -std=c++11 imbue_issue.cpp
First thing: it seems that wss.seekg(std::ios_base::end) does not moves file position at the end of the file:
$ ./a.out 1line
filelen: 2
strlen: 9
str: 99?/o of
Second thing is when uncomment locale related lines, getline reads only 2 characters:
$ ./a.out 1line
filelen: 2
strlen: 2
str: 99
My compiler:
$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Does anyone have idea what is the reason why above issues occur with this file?
The problem is how you call the seekg
function. When you call it with one argument it is used as an absolute position from the beginning, and you will seek to whatever value std::ios::end
have, which happens to be 2
in your case.
Instead you should use the two-argument overload:
wss.seekg(0, std::ios_base::end); // Seek to offset 0 from the end
You will still have problems reading the file using wide-character types, since the contents doesn't seem to be that. UTF-8 is a multi-byte narrow character encoding.