So to turn a text file into a dataframe of features, I am writing a custom function that would be able to do so. Now I want the function to be able to find question/exclamation marks in the text input to then transform this into a value in a df.column. My part of the function looks like this:
discount = ['[%]','[€]','[$]','[£]','korting','deal','discount','reduct','remise','voucher',
'descuento', 'rebaja', 'скидка', 'sconto','rabat','alennus','kedvezmény',
'할인','折扣','ディスカウント','diskon']
data = [text_input.split()]
for word in data:
if any(char in discount for char in word):
df['discount'] = 1
else:
df['discount'] = 0
for word in data:
if any(char == '!' for char in word):
df['exclamation'] = 1
else:
df['exclamation'] = 0
for word in data:
if any(char == '?' for char in word):
df['question'] = 1
else:
df['question'] = 0
The problem is that if the text input, for example, contains: 'discount!' it does not recognize the '!' or word 'discount', resulting in a 0 in both the specified columns. Now if I remove the '!' from 'discount' it recognizes them both.
Therefore I am wondering how I need to split my text_input
to make sure it strips the '!' from the words. Or is there a more efficient way to find these characters?
Thanks in advance!
Managed to solve it. This is my updated code that works:
data_str = [re.split('[*?*! ]', text_input)]
data_chr = [re.findall('[^A-Za-z0-9]', text_input)]
for word in data_str:
if any(phrase in word for phrase in discount):
df['discount'] = 1
else:
df['discount'] = 0
for word in data_chr:
if '!' in word:
df['exclamation'] = 1
else:
df['exclamation'] = 0
for word in data_chr:
if '?' in word:
df['question'] = 1
else:
df['question'] = 0