Search code examples
c++filefile-ioifstreambyte-order-mark

C++ - Am I using fin.ignore() incorrectly?


I have a .txt file called "1.txt" that I want to read in. Since the file starts with 8 BOM characters, if I do the following:

ifstream fin("1.txt");

string temp = "";

char c = fin.get();

    while (!fin.eof())
    {
        if (c >= ' ' && c <= 'z')
        {
            temp += c;
        }

        c = fin.get();
    }

    cout << temp;

This will print nothing, because of something the BOM is doing.

So, I decided to use the fin.ignore() function, in order to ignore the beginning BOM characters of the file. However, still nothing is being printed. Here is my complete program:

#include <iostream>
#include <fstream>
#include <string>
#include <istream>

using namespace std;

int main()
{
ifstream fin("1.txt");

if (fin.fail())
{
    cout << "Fail\n";
}

else
{
    string temp = ""; // Will hold 1.txt's contents.

    fin.ignore(10, ' ');
    // Ignore first 10 chars of the file or stop at the first space char,
    // since the BOM at the beginning is causing problems for fin to read the file.
    // BOM is 8 chars, I wrote 10 to just be safe.

    char c = fin.get();

    while (!fin.eof())
    {
        if (c >= ' ' && c <= 'z') // checks if c stores a standard char.
        {
            temp += c;
        }

        c = fin.get();
    }

    cout << temp;

    // PROBLEM:  No text is printed to the screen from the above command.

    cout << temp.size(); // prints 0
}
}

I hypothesize that after the: ifstream fin("1.txt"); line, it is already too late, since the BOM probably affected things with fin then. So I need to somehow tell fin to ignore the BOM characters before it reads in the file, but I can't use fin.ignore() since I wouldn't have declared a fin object yet.

Also, I know I can manually delete the BOM from my .txt file, but I'm looking for a solution that only involves me writing a C++ program. If I have thousands or millions of .txt files, deleting manually is not an option. Also, I'm not looking to download new software, like Notepad++

Here is all I have in the file "1.txt":

ÐÏࡱá Hello!

This site's formatting doesn't let me show it, but in the actual file there are about 15 spaces between the BOM and Hello!


Solution

  • According to cppreference, the character with value \x1a terminates input on Windows in text mode. You presumably have such a character right near the beginning. My empty .doc file has one as the 7th byte.

    You should read the file in binary mode:

    std::ifstream fin("1.txt", std::ios::binary);
    

    You can still use ignore to ignore a prefix. However, it's kind of flaky ignoring until a specific character. The binary prefix could contain that character. If these prefixes are always the same length, ignoring a specific number of bytes suffices. In addition, you can't rely on looking at the file in Notepad to count the number of bytes. There are quite a few invisible characters. You should be looking at the hex view of the file instead. Many good text editors can do this, or you can use Powershell's Format-Hex -Path <path> command. For example, here's the first few lines of mine:

    00000000   D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00  ÐÏ.ࡱ.á........
    00000010   00 00 00 00 00 00 00 00 3E 00 03 00 FE FF 09 00  ........>...þ...
    00000020   06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................
    

    It's unclear what the best way to remove the prefixes is without more information.