Search code examples
pythonazure-machine-learning-service

How to strip HTML from a text column in Azure ML Execute Python Script step


If I have a column of data of type string in an incoming Azure ML dataset that contains HTML tags screwing up my results, how can I remove those tags?


Solution

  • Like this:

    def azureml_main(dataframe1 = None, dataframe2 = None):
      dataframe1[1] = dataframe1['text'].str.replace('<[^<]+?>', ' ', case=False)
      return dataframe1,
    

    Remember to precede the Execute Python Script step with Clean Missing Data step and change the action to remove the entire row (if appropriate). This is important because the Execute Python Script step cannot return an empty dataframe. Only you know your data, in this case.

    Let me also point out that the Preprocessing Text step allows you to apply a Regular Expression. That is another alternative that might be right for your situation.