I am having some trouble with unicode characters, and I was wondering if my code below is done in a proficient way. In short, I want to enter 2 words, the first one is blue
and the second one is blå
. They will be saved in two different text files and then the program will read from the files and print them correctly in the terminal. I am mainly interested in improvements of lines regarding unicode, _setmode
, wide characters etc. Here is the code:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include <string.h>
#include <io.h>
#include <fcntl.h>
#define _O_U16TEXT 0x20000
#define _O_DEFAULT 0x4000
#define SIZE 1000
typedef struct {
wchar_t sweWord[SIZE]; //sweWord=Swedish Word
char engWord[SIZE]; //engWord=English Word
} Word;
void set_mode_to_UTF16() {
fflush(stdin);
fflush(stdout);
_setmode(_fileno(stdin), _O_U16TEXT);
_setmode(_fileno(stdout), _O_U16TEXT);
}
void set_mode_to_default() {
_setmode(_fileno(stdin), _O_DEFAULT);
_setmode(_fileno(stdout), _O_DEFAULT);
}
Word enterWord() {
Word aWord;
printf("Enter english word \"blue\": ");
scanf("%s", aWord.engWord);
printf("You entered: %s\n", aWord.engWord);
set_mode_to_UTF16();
wprintf(L"Enter swedish word \"blå\": ");
wscanf(L"%ls", aWord.sweWord);
wprintf(L"You entered: %ls\n", aWord.sweWord);
set_mode_to_default();
return aWord;
}
void saveWord(Word aWord) {
FILE *pFile1;
FILE *pFile2;
if(pFile1=fopen("ENGWORD.txt", "w")) {
fprintf(pFile1, "%s\n", aWord.engWord);
} else {
printf("Failed to save ENGWORD!\n");
}
fclose(pFile1);
set_mode_to_UTF16();
if(pFile2=fopen("SWEWORD.txt", "w")) {
_setmode(_fileno(pFile2), _O_U16TEXT);
fwprintf(pFile2, L"%ls\n", aWord.sweWord);
} else {
wprintf(L"Failed to save SWEWORD!\n");
}
fclose(pFile2);
set_mode_to_default();
}
Word loadWord() {
Word aWord;
FILE *pFile1;
FILE *pFile2;
if(pFile1=fopen("ENGWORD.txt", "r")) {
fscanf(pFile1, "%s", aWord.engWord);
} else {
printf("Failed to load ENGWORD!\n");
}
fclose(pFile1);
set_mode_to_UTF16();
if(pFile2=fopen("SWEWORD.txt", "r")) {
_setmode(_fileno(pFile2), _O_U16TEXT);
fwscanf(pFile2, L"%ls\n", aWord.sweWord);
} else {
wprintf(L"Failed to save SWEWORD!\n");
}
fclose(pFile2);
set_mode_to_default();
return aWord;
}
int main(void) {
int defaultMode;
defaultMode=_setmode(_fileno(stdin), _O_BINARY);
printf("Default mode is %d\n", defaultMode);
_setmode(_fileno(stdin), defaultMode); //mode is now in default.
Word wordToSave;
Word wordToLoad;
wordToSave=enterWord();
saveWord(wordToSave);
wordToLoad=loadWord();
printf("Loaded english word is %s\n", wordToLoad.engWord);
set_mode_to_UTF16();
wprintf(L"Loaded swedish word is %ls\n", wordToLoad.sweWord);
set_mode_to_default();
printf("Done! Signing off...\n");
return 0;
}
My output is:
Default mode is 16384
Enter english word "blue": blue
You entered: blue
Enter swedish word "blå": blå
You entered: blå
Loaded english word is blue
Loaded swedish word is blå
Done! Signing off...
There are 2 parts I am unsure of. Firstly, a quote from this website:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode?view=msvc-170
If you write data to a file stream, explicitly flush the code by using fflush before you use _setmode to change the mode.
I did this in the set_mode_to_UTF16
function, but not in the set_mode_to_default
function. Why should there be a difference between them?
Secondly, I have seen a lot of posts where setlocale
is used in order to change the locale to UTF-16. However, in my code I don't use it, which makes me wonder if I have done something wrong.
I wonder if I could get some input and feedback on my code as mentioned earlier and, if possible, help me get a better understanding of the 2 problems I was wondering about. Thanks in advance!
I use Windows 11, VSCode and MINGW-32.
Regarding fflush, I'll quote the man page
For output streams, fflush() forces a write of all user-space buffered data for the given output or update stream via the stream's underlying write function.
For input streams associated with seekable files (e.g., disk files, but not pipes or terminals), fflush() discards any buffered data that has been fetched from the underlying file, but has not been consumed by the application.
The behavior of this is documented on other questions on stack overflow. See also comments by William Pursell below.
This call can't hurt, I'd certainly fflush the stdout between switching in both directions, but it is important to note that you are implicitly flushing file buffers when you call fclose
which I think is more in the spirit of what the documentation you shared is describing when it says "If you write data to a file stream, explicitly flush the code by using fflush before you use _setmode to change the mode."
For reference, the fclose man page says the following:
The fclose() function flushes the stream pointed to by stream (writing any buffered output data using fflush()) and closes the underlying file descriptor.
I suspect the important flushing of buffered data that could potentially be corrupted by switching locales is actually occurring in your fclose
function calls for the most part because you are saving the words in two different files. It would be interesting to see if you can save them in a single file by calling fflush
instead and writing both sets of data to the same file!
For your second question, setting the locale takes care of many more steps than what you are doing manually here AND provides additional functionality through library functions.
Chapter 7 of the GNU C Library Reference Manual is a great place to read up more on locales and why you might use them for what you're doing here. Section 7.1 has the following information about what changing a locale can influence beyond just UTF encoding width:
Each locale specifies conventions for several purposes, including the following:
- What multibyte character sequences are valid, and how they are interpreted (*note Character Set Handling::).
- Classification of which characters in the local character set are considered alphabetic, and upper- and lower-case conversion conventions (*note Character Handling::).
- The collating sequence for the local language and character set (*note Collation Functions::).
- Formatting of numbers and currency amounts (*note General Numeric::).
- Formatting of dates and times (*note Formatting Calendar Time::).
- What language to use for output, including error messages (*note Message Translation::).
- What language to use for user answers to yes-or-no questions (*note Yes-or-No Questions::).
- What language to use for more complex user input. (The C library doesn’t yet help you implement this.)
Some aspects of adapting to the specified locale are handled automatically by the library subroutines. For example, all your program needs to do in order to use the collating sequence of the chosen locale is to use ‘strcoll’ or ‘strxfrm’ to compare strings.
I wanted to update with answer with something that I've just tried out myself in another context that sheds more light on why you might consider using a locale system.
Consider how in the following code, changing locales allows for simple mapping of characters within that locale (for example, L'é'
to L'É'
). Though I haven't seen them personally, the libc reference manual states that locales are free to define transformations other than toupper and tolower; those two are specially guaranteed to exist in any locale.
setlocale(LC_ALL, "C.UTF-8");
wchar_t * wide_chars = L"Here ÃrE sõmé chAracters";
printf("wchar mapping demo string: %ls\n", wide_chars);
printf("Converting to upper: ");
for(size_t i = 0; i < wcslen(wide_chars); i++)
{
/* demo of how to do it for any class conversion */
printf("%lc", towctrans(wide_chars[i], wctrans("toupper")));
}
printf("\n");
printf("Converting to lower: ");
for(size_t i = 0; i < wcslen(wide_chars); i++)
{
/* demo of how to do it using built ins */
printf("%lc", towlower(wide_chars[i]));
}
printf("\n");
output
wchar mapping demo string: Here ÃrE sõmé chAracters
Converting to upper: HERE ÃRE SÕMÉ CHARACTERS
Converting to lower: here ãre sõmé characters