I am trying to optimize the performance of a search engine by preprocessing all the results. We have around 50k search terms. I am planning to search these 50k terms before hand and save it in memory (memcached/redis). Searching for all 50k terms takes more than a day in my case since we do deep semantic search. So I am planning to distribute the searching (preprocessing) over several nodes. I was considering to use hadoop. My input size is very less. Probably less than 1MB even though total search term is over 50k. But searching each term takes over a min i.e more computation oriented than data oriented. So I am wondering if I should use Hadoop or build my own distributed system. I remember reading that hadoop is used mainly if input is very huge. Please suggest me on how to go about this.
And I read hadoop reads data in block size. i.e 64mb for each jvm/mapper. Is it possible to make it number of lines instead of block size. Example: Every mapper gets 1000 lines instead of 64mb. Is it possible to achieve this.
Hadoop can definitely handle this task. Yes, much of Hadoop was designed to handle jobs with very large input or output data, but that's not it's sole purpose. It can work well for close to any type of distributed batch processsing. You'll want to take a look at NLineInputFormat; it allows you to split your input up based on exactly what you wanted, number of lines.