Search code examples
pythonstringpython-3.xasciinon-ascii-characters

Replacing problematic character(s) in visually identical python strings with their standard equivalents


I am attempting to locate strings in a dataframe column that contain specific words/patterns in Python 3.7.

In this example, I am looking for any string that contains the name of a month or any year (from 2016-2030)

I am doing this as follows (I'm sure there are better ways to do this, though for now this is what I'm doing):

years = ['2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024', '2025', '2026', '2027', '2028', '2029', '2030']

months = ['January', 'january', 'February', 'february', 'March', 'march', 'April', 'april', 'May', 'may', 'June', 'june', 'July', 'july', 'August', 'august', 'September', 'september', 'October', 'october', 'November', 'november', 'December', 'december']

hasDate = df.loc[:, 'text'].apply(lambda x: x.split('?')[0].split('. ')[-1]).str.contains('|'.join(years+months))

This works as expected and most rows with strings in the 'text' column containing either a year OR month return 'True'. (the split operations are honing in on a particular sentence included in the string)

However, there are some instances where the text string visibly contains the name of a month, yet it 'False' is returned.

Example:

>>> df.loc[133, 'text']
'May 3'

returns False after the above operation.

>>> string = df.loc[133, 'text']
>>> string == 'May 3'
False

When I copy/paste the text output of 'string' into the python terminal in IntelliJ, it notes that the word 'May' is misspelled.

After having searched for ways to identify the precise difference between the the two strings, I attempted the following:

>>> ascii('May 3')
"'May 3'"

>>> ascii(string)
"'M\\u0430y 3'"

So clearly there is some issue with the 'a' character contained in the string causing it not to match 'May'

While I've read methods on stripping these problematic characters from strings, I can't quite figure out how I might convert this, and other problematic strings to their standard equivalents. I apologize in advance if there are similar existing questions out there, though I wasn't able to find one with a working solution to this specific problem.

These strings are sourced via an API of a messaging app, where each message is a self contained 'object' and the raw text is extracted via msg.raw_text. I iterate through each message and append the raw text to a dataframe column (df['text']), I expect this is where there is opportunity to intercept these problematic characters, though I'm not quite sure how to solve this short from including the raw 'M\u0430y 3' as one of the items to search for.

Any help is greatly appreciated!


Solution

  • Thanks to help from garlon4 who pointed me in the right direction, I was able to solve this issue using the Unidecode package.

    >>> ascii('May 3')
    "'May 3'"
    
    >>> ascii(string)
    "'M\\u0430y 3'"
    
    >>> from unidecode import unidecode
    >>> ascii(unidecode(string))
    "'a'"
    
    >>> unidecode(string) == 'May 3'
    True