Search code examples
pythondataframeif-statementspecial-characters

Check for special character in text with Python function


So to turn a text file into a dataframe of features, I am writing a custom function that would be able to do so. Now I want the function to be able to find question/exclamation marks in the text input to then transform this into a value in a df.column. My part of the function looks like this:

discount = ['[%]','[€]','[$]','[£]','korting','deal','discount','reduct','remise','voucher', 
            'descuento', 'rebaja', 'скидка', 'sconto','rabat','alennus','kedvezmény',
            '할인','折扣','ディスカウント','diskon']
data = [text_input.split()]

for word in data:
    if any(char in discount for char in word):
        df['discount'] = 1
    else:
        df['discount'] = 0
for word in data:
    if any(char == '!' for char in word):
        df['exclamation'] = 1
    else:
        df['exclamation'] = 0
for word in data:
    if any(char == '?' for char in word):
        df['question'] = 1
    else:
        df['question'] = 0

The problem is that if the text input, for example, contains: 'discount!' it does not recognize the '!' or word 'discount', resulting in a 0 in both the specified columns. Now if I remove the '!' from 'discount' it recognizes them both.

Therefore I am wondering how I need to split my text_input to make sure it strips the '!' from the words. Or is there a more efficient way to find these characters?

Thanks in advance!


Solution

  • Managed to solve it. This is my updated code that works:

    data_str = [re.split('[*?*! ]', text_input)]
    data_chr = [re.findall('[^A-Za-z0-9]', text_input)]
    
    for word in data_str:
        if any(phrase in word for phrase in discount):
            df['discount'] = 1
        else:
            df['discount'] = 0
    for word in data_chr:
        if '!' in word:
            df['exclamation'] = 1
        else:
            df['exclamation'] = 0
    for word in data_chr:
        if '?' in word:
            df['question'] = 1
        else:
            df['question'] = 0