Search code examples
machine-learningazure-machine-learning-service

Machine Learning-Classifying web page as address and no-address by content


Currently I am using azure machine learning .I train my ML with sets of data of two types they are nothing but web page content with address and without address

TRAINING INPUT:

i.e)
this is a address no 24/5    address
this is no address    no-address 

I am using two-class bayesian classification to classify them should i use any other method

GIVEN INPUT:

i.e)
This a address 12/4 

OBTAINED OUTPUT:

i.e)
content    score    probability
This a address 12/4    no-address    0.54

EXPECTED OUTPUT:

i.e)
content    score    probability
This a address 12/4    address    with higher probability 

My experiment looks like :

enter image description here


Solution

  • You need to use the Feature Hashing module to convert the text into word features. This, however, might not be enough as words are not good features for your problem. You may want to do some processing of the text and create more useful features (perhaps detecting the presence of zip codes, positions of numbers, etc...)

    Edit: Using the raw text column as one feature will not get you anywhere. You don’t want your model to learn the addresses the way they are written. Instead, you need to learn patterns in the text that provide evidence for address vs. non-address instances. When you use feature hashing, the text column will be transformed to multiple word (or n-gram) columns, where the values represent counts of those words in each text input. The problem here is overfitting. For example, these two addresses have no words in common: “100 broadway st, GA” and “200 main rd, NY” but it’s clear they have similar structure. One way to create ‘useful features’ is to replace the words with tags: “#NUM #TXT, #STATE” and use feature hashing (bi-grams) to create features such as “#NUM #TXT” and “, #STATE”. As you can see, these bi-grams count as evidence in both addresses and suggest some kind of similarity between them (compared to other non-address instances). Of course this is an oversimplification of the problem but I hope you see why you can’t use the raw text or plain feature hashing.
    You can still use the Azure ML modules for feature hashing, training, and scoring in addition to an ‘Execute R’ module to do the text processing before training.

    Edit: Example of feature hashing usage: http://gallery.azureml.net/Details/cf65bf129fee4190b6f48a53e599a755