Search code examples
artificial-intelligencevowpalwabbit

VowpalWabbit incorrect predictions. How to properly prepare learning data?


I'm trying to learn VW to predict houses prices based on number of bedrooms, bathrooms, area and other features. My training data example line are:

68000 '51-OMAHA-CT| city=SACRAMENTO zip=95823 state=CA beds:3 baths:1 sq__ft:1167 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.478902 longitude=-121.431028
56333 '3526-HIGH-ST| city=SACRAMENTO zip=95838 state=CA beds:2 baths:1 sq__ft:836 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.631913 longitude=-121.434879
68790 '2796-BRANCH-ST| city=SACRAMENTO zip=95815 state=CA beds:2 baths:1 sq__ft:796 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.618305 longitude=-121.443839

PRICE STREET | ... In total about 500 record. My test data are (about 500 records as well):

'51-OMAHA-CT| city=SACRAMENTO zip=95823 state=CA beds:3 baths:1 sq__ft:1167 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.478902 longitude=-121.431028
'3526-HIGH-ST| city=SACRAMENTO zip=95838 state=CA beds:2 baths:1 sq__ft:836 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.631913 longitude=-121.434879
'2796-BRANCH-ST| city=SACRAMENTO zip=95815 state=CA beds:2 baths:1 sq__ft:796 type=Residential sale_date=Wed-May-21-00-00-00-EDT-2008 latitude=38.618305 longitude=-121.443839

Predicting gives these values:

4819.900391 51-OMAHA-CT
4609.826172 3526-HIGH-ST
4537.140137 2796-BRANCH-ST

These aren't correct predictions. I am not sure if there's a problem with my training data? I'm still confused about | char and placing features.


Solution

  • When you construct a feature as city=SACRAMENTO, VW is interpreting that as a string feature with name city=SACRAMENTO and assigning it an implicit value of 1.0. city=SACRAMENTO is hashed and that forms the index for the feature.

    When you construct a feature as beds:2, VW is interpreting it as a feature with name beds and a feature value of 2.0. beds is hashed and forms the index.

    So think of features in the form __=__ as enums, or values from a discrete set. When you have continuous features then a float value should be used.

    Using the __=__ format seems fine for city names, but when you use this same format for latitude and longitude then it is very unlikely that another example is going to share the same exact lat/lng string be able to use that feature in the prediction. It seems to me that lat/lng should be a float based feautre.

    For the sale_date you have a similar problem. This is perhaps more of a feature engineering thing, but perhaps you want to split this feature apart into year, day of the week, month, etc.