Search code examples
parsingdata-miningontologyprotege

A data extraction - Need Ideas



Consider there are n rows of text similar to the ones below:

  • "Sony KDL46NX720 BRAVIA 46" 3D LED Backlit HDTV - 1080p, 1920 x 1080, 16:9, 120Hz, HDMI, USB, WiFi Ready » for $1148.99 at Tiger Direct"

  • "Samsung NV40 10.5 MP Digital Camera - Silver - 3x Zoom Lens » for $64.99 at eBay"

  • "Gateway NV57H27u 15.6" Notebook, Intel Core i3-2310M (2.10GHz), 4GB DDR3 Memory, 500GB HDD, DVD Super Multi-Drive, Windows 7 Home Premium 64-Bit (Pink) - LX.WZF02.002 » for $399.99 at Buy.com"

I would like to parse these strings and classify each of them as "TV, camera, laptop" etc.

The text attributes may or may not be similar.


How can this be comprehensively done?

What code/tools should I use?

What language?

I do not want to do a keyword search. Can this strings be classified using class/attribute logic?

Can I use Protege to build the class/sub-class hierarchy?


I am totally new to this field of data-mining. So excuse my ignorance!

Thanks in advance.


Solution

  • Regular expresions, even a javascript can do the work

    EDIT:

       var criteria = {
          camera : {
             identifier : /.*camera.*/ ,
             resolution : /.*(\d+)\s*x\s*(\d*).*/ ,
             value : /.*$(\d+).*/ ,
             ...
          },
          notebook : {
             identifier : /.*notebook.*/ ,
             ram : /.*(d+)GB\s*(DDR.).*/
             ...
          }
          ...
       }
    
    

    Then write a simple engine that use this structure to analize each line

    EDIT 2:

    This is not easy at all because you heve to feed some sort of knowlege database, but is posible, you can feed this with pages like this.

    http://en.wikipedia.org/wiki/List_of_CPU_power_dissipation

    but is work for more than one person or for more than one day depending on how much intelligence you want for your code.