Search code examples
utf-8sizefgetsutf-16unicode-string

Same .txt files, different sizes?


I have a program that reads from a .txt file

I use the cmd prompt to execute the program with the name of the text file to read from.

ex: program.exe myfile.txt

The problem is that sometimes it works, sometimes it doesn't.

The original file is 130KB and doesn't work. If I copy/paste the contents, the file is 65KB and works. If I copy/paste the file and rename it, it's 130KB and doesn't work.

Any ideas?

After more testing it shows that this is what makes it not work:

int main(int argc, char *argv[])
{
    char *infile1
    char tmp[1024] = { 0x0 };
    FILE *in;
    for (i = 1; i < argc; i++)  /* Skip argv[0] (program name). */
    {
        if (strcmp(argv[i], "-sec") == 0)  /* Process optional arguments. */
        {
            opt = 1;  /* This is used as a boolean value. */

            /*
            * The last argument is argv[argc-1].  Make sure there are
            * enough arguments.
            */

            if (i + 1 <= argc - 1)  /* There are enough arguments in argv. */
            {
                /*
                * Increment 'i' twice so that you don't check these
                * arguments the next time through the loop.
                */

                i++;
                optarg1 = atoi(argv[i]);  /* Convert string to int. */

            }
        }
        else /* not -sec */
        {
            if (infile1 == NULL) {
                infile1 = argv[i];
            }
            else {
                if (outfile == NULL) {
                    outfile = argv[i];
                }
            }
        }
     }

     in = fopen(infile1, "r");    

     if (in == NULL) 
     {
           fprintf(stderr, "Unable to open file %s: %s\n", infile1, strerror(errno));
           exit(1);
     }

     while (fgets(tmp, sizeof(tmp), in) != 0)
     {
         fprintf(stderr, "string is %s.", tmp);
         //Rest of code
     }
}

Whether it works or not, the code inside the while loop gets executed.

When it works tmp actually has a value. When it doesn't work tmp has no value.

EDIT:

Thanks to sneftel, we know what the problem is, For me to use fgetws() instead of fgets(), I need tmp to be a wchar_t* instead of a char*. Type casting seems to not work. I tried changing the declaration of tmp to wchar_t tmp[1024] = { 0x0 }; but I realized that tmp is a parameter in strtok() used elsewhere in my code. I here is what I tried in that function:

//tmp is passed as the first parameter in parse()
void parse(wchar_t *record, char *delim, char arr[][MAXFLDSIZE], int *fldcnt)
{
    if (*record != NULL)
    {
        char*p = strtok((char*)record, delim);
        int fld = 0;
        while (p) {
            strcpy(arr[fld], p);
            fld++;
            p = strtok('\0', delim);
        }
        *fldcnt = fld;
    }
    else
    {
        fprintf(stderr, "string is null");
    }
}

But typecasting to char* in strtok doesn't work either.

Now I'm looking for a way to just convert the file from UTF-16 to UTF-8 so tmp can be of type char* I found this which looks like it can be useful but in the example it uses input from the user as UTF-16, how can that input be taken from the file instead? http://www.cplusplus.com/reference/locale/codecvt/out/


Solution

  • It sounds an awful lot like the original file is UTF-16 encoded. When you copy/paste it in your text editor, you then save the result out as a new (default encoding) (ASCII or UTF-8) text file. Since a single character takes 2 bytes in a UTF-16-encode file but only 1 byte in a UTF-8-encoded file, that results in the file size being roughly halved when you save it out.

    UTF-16 is fine, but you'll need to use Unicode-aware functions (that is, not fgets) to work with it. If you don't want to deal with all that Unicode jazz right now, and you don't actually have any non-ASCII characters to deal with in the file, just do the manual conversion (either with your copy/paste or with a command-line utility) before running your program.