Search code examples
c++stringascii

C++ Strip non-ASCII Characters from string


Before you get started; yes I know this is a duplicate question and yes I have looked at the posted solutions. My problem is I could not get them to work.

bool invalidChar (char c)
{ 
    return !isprint((unsigned)c); 
}
void stripUnicode(string & str)
{
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end()); 
}

I tested this method on "Prusæus, Ægyptians," and it did nothing I also attempted to substitute isprint for isalnum

The real problem occurs when, in another section of my program I convert string->wstring->string. the conversion balks if there are unicode chars in the string->wstring conversion.

Ref:

How can you strip non-ASCII characters from a string? (in C#)

How to strip all non alphanumeric characters from a string in c++?

Edit:

I still would like to remove all non-ASCII chars regardless yet if it helps, here is where I am crashing:

// Convert to wstring
wchar_t* UnicodeTextBuffer = new wchar_t[ANSIWord.length()+1];
wmemset(UnicodeTextBuffer, 0, ANSIWord.length()+1);
mbstowcs(UnicodeTextBuffer, ANSIWord.c_str(), ANSIWord.length());
wWord = UnicodeTextBuffer; //CRASH

Error Dialog

MSVC++ Debug Library

Debug Assertion Failed!

Program: //myproject

File: f:\dd\vctools\crt_bld\self_x86\crt\src\isctype.c

Line: //Above

Expression:(unsigned)(c+1)<=256

Edit:

Further compounding the matter: the .txt file I am reading in from is ANSI encoded. Everything within should be valid.

Solution:

bool invalidChar (char c) 
{  
    return !(c>=0 && c <128);   
} 
void stripUnicode(string & str) 
{ 
    str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
}

If someone else would like to copy/paste this, I can check this question off.

EDIT:

For future reference: try using the __isascii, iswascii commands


Solution

  • Solution:

    bool invalidChar (char c) 
    {  
        return !(c>=0 && c <128);   
    } 
    void stripUnicode(string & str) 
    { 
        str.erase(remove_if(str.begin(),str.end(), invalidChar), str.end());  
    }
    

    EDIT:

    For future reference: try using the __isascii, iswascii commands