Search code examples
opennlp

What is a good strategy to find colors, units, sizes using OpenNLP


Say we have a string like such:

4 pallets of books with a weight of 437 kg. The pallets measure 80 x 120 x 120 cm each and are protected with red shrinkwrap.

What is the best approach to extract information like this (especially color, weight and sizes) using OpenNLP... Thinking about some customized corpus and own trainings.. but I have no idea which approach is the best to start with.

<pallet amount>4</pallet amount> pallets of <product>books</product> with a weight of <weight>437</weight> <weightUnit>kg</weightUnit>. The pallets measure <height>80</height> x <width> 120 </width> x <length>120 </length> <measurementUnit>cm</measurementUnit> each and are protected with <color>red</color> shrinkwrap.

Solution

  • You've only listed one approach (customized training using OpenNLP), so I don't know what you think your other choices are. This approach is almost certainly your best one, unless the phrases you're searching for are (a) regular and (b) distinct for other phrases, in which case you can use regular expressions.

    There's a wide variety of packages that allow you to train and tag: OpenNLP is one, Stanford NE is another. They use different training approaches, and that will affect your results. But once you have your training data, you can try it out with different engines and see how it does.