Search code examples
c++fileutf-8ifstreamtxt

c++ how can i work and validate a utf8 character


I'm working on a project where I have to work with special characters.

I am working on windows 10 in the same way I need my solution to work on linux as well, what I need is to read a text file with utf8 encoding, do certain validations and display the text of the file on the screen.

I am working with dev c++ 5.11

I currently have no major problem reading the file with the special characters and displaying it on the console, my problem lies in trying to obtain the special character separately to perform validations.

At the moment the .txt that I am trying to read contains the following information:

Inicio
D1
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

the character I'm having trouble with is: ' ó '

I am using the following code:

#include <iostream>
#include <locale.h>
#include<fstream>
#include<string>
#include <windows.h>
#define CP_UTF8 65001 

using std::cout;

int main(){
    
    std::ifstream file;
    std::string text;
    
    if (!SetConsoleOutputCP(CP_UTF8)) {
        std::cerr << "error: unable to set UTF-8 codepage.\n";
        return 1;
    }
    
    file.open("entryDisciplineESP.txt");
    
    int line = 0;
    
    if (file.fail()){
        
        cout<<"Error. \n";
        
        exit(1); 
        
    }

    while(std::getline(file,text)){ 
        
        if(line == 2){
            
            cout<<text[5]<<"\n";
            
        }
        
        std::cout<<text<<"\n";
        
        line++;
        
    }
    
    cout<<"\n";
    
    system("Pause");
    return 0;
}

I am getting the following from the console:

Inicio
D1

Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

my problem is that when I try to print the character ' ó ' separately it does not do it, on the contrary it is printing a blank space and I need to work with that character to be able to do validations for example, I need to verify that there are no numbers or other types in that text of characters such as '?', besides that I would like to do other things to facilitate the work.

How can I achieve what I need? I have read about converting that text from utf8 to utf16 but I haven't achieved that successfully and I don't know if it works, any suggestions?

I appreciate all help in advance.

EDIT 1.

Seeing that the general recommendation is to convert from utf-8 to utf-32 to do the validation work, I have managed to implement the #include <codecvt> library, now using dev c++ 6.3, implement the following recommended function for testing:


std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

Now in the conditional I have updated and I am calling the function.

if(line == 2){
    
            std::cout<<text[5]<<"\n";
            std::wstring a = utf8_to_ws(text);
            std::wcout<<a<<"\n";
            
        }

and now I am getting the following output in the console:

Inicio
D1

Biatln
Biatlón
S1
255
E1
Esprint 7,5 km (M); 100; 200
E2
Persecucion 10 km (M); 100; 200
ff

for some reason it keeps omitting the ' ó ' character, I appreciate help to solve this problem.


Solution

  • First of all, let me tell you that there is a problem with the console encoding change to UTF-8, not all characters are directly convertible. So using SetConsoleOutputCP(CP_UTF8) doesn't ensure that everything will be rendered to the screen, it will be replaced by some other weird character.

    You would no longer use SetConsoleOutputCP(CP_UTF8), instead you would set the default LOCALE for your system (eg: std::setlocale(LC_ALL, "")). This will cause the accented vowels to appear on the console, just like any other printable character on the console.

    Then, I would do the conversion using the function you provided above (utf8_to_ws), to transform the UTF-8 string into UTF-16.

    Finally, to avoid making your life complicated, you would use this statement to convert the UTF-8 string to UNICODE characters of type wchar_t:

    std::printf("%ls", utf8.c_str())
    

    This will do the rest for you, allowing you to print the vast majority of characters on the console (all those that are printable for your system).

    If you finally use SetConsoleOutputCP(CP_UTF8), you will be able to print a few more characters than if you only use the local encoding, but this sentence is not standard, which would lead you to study alternatives if you are in environments other than Windows.