If I have a column of data of type string in an incoming Azure ML dataset that contains HTML tags screwing up my results, how can I remove those tags?
Like this:
def azureml_main(dataframe1 = None, dataframe2 = None):
dataframe1[1] = dataframe1['text'].str.replace('<[^<]+?>', ' ', case=False)
return dataframe1,
Remember to precede the Execute Python Script
step with Clean Missing Data
step and change the action to remove the entire row (if appropriate). This is important because the Execute Python Script
step cannot return an empty dataframe
. Only you know your data, in this case.
Let me also point out that the Preprocessing Text
step allows you to apply a Regular Expression. That is another alternative that might be right for your situation.