Search code examples
cstringfilewhile-loopchar

fwscanf failing to read UTF-8 CSV file correctly in C


This program can only use libraries of the C standard.

I'm trying to read a UTF-8 encoded CSV file in C using fwscanf, but I'm encountering issues with the reading process. The file contains rows with a string and a float value separated by a comma. Here's a minimal example demonstrating the problem:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

#define MAX_STRING_LENGTH 31

int main() {
    setlocale(LC_ALL, "en_US.UTF-8");
    FILE *file = fopen("input.csv", "r, ccs=UTF-8");
    if (file == NULL) {
        fwprintf(stderr, L"Error opening file.\n");
        return 1;
    }

    wchar_t string[MAX_STRING_LENGTH];
    float frequency;
    int row = 0;

    while (!feof(file)) {
        row++;
        int result = fwscanf(file, L"%30[^,],%f,", string, &frequency);
        
        if (result == 2) {
            wprintf(L"Row %d: String = '%ls', Frequency = %.4f\n", row, string, frequency);
        } else if (result == 1) {
            wprintf(L"Row %d: String = '%ls', Frequency not read\n", row, string);
        } else if (result == EOF) {
            break;
        } else {
            wprintf(L"Error reading row %d\n", row);
            wchar_t c;
            // Skip the rest of the line
            while ((c = fgetwc(file)) != L'\n' && c != WEOF);
        }
    }

    fclose(file);
    return 0;
}

Sample input.csv:

hello,1.0000
world,0.5000
how,0.7500
are,0.2500
you,1.0000
?,0.5000

Expected output:

Row 1: String = 'hello', Frequency = 1.0000
Row 2: String = 'world', Frequency = 0.5000
Row 3: String = 'how', Frequency = 0.7500
Row 4: String = 'are', Frequency = 0.2500
Row 5: String = 'you', Frequency = 1.0000
Row 6: String = '?', Frequency = 0.5000

The issue I'm facing is that fwscanf is not reading the file correctly. It either reads incorrect values or fails to read at all. I've tried using different locale settings and file opening modes, but the problem persists.


Solution

  • The argument string is not consistent with the L"%30[^,],%f," format string. %[ expects a pointer to a char array that will receive the conversion of the wide characters read from the stream to their multibyte representation.

    You want to perform the opposite task: convert the UTF-8 encoded input byte stream into a wide string, ie: an array of wchar_t. You should use fscanf("%30l[^,],%f,", string, &frequency) for this instead.

    Unless you need to use wide strings in the rest of the program, converting from UTF-8 seems unnecessary as this encoding is fully compatible with the CSV syntax and all its variants.