I am trying to design a distributed-scalable pipeline based on UIMA. How should I decide over using UIMA DUCC or UIMA on Hadoop? What could I be missing out, if I build it on UIMA DUCC rather than Hadoop or vice-versa?
One dimension is application characteristics. Hadoop will have a big advantage for I/O intensive applications. DUCC should have a big advantage for large memory applications that need to run multiple pipeline copies in different threads to achieve high CPU utilization.
Another dimension is taking advantage of UIMA vs taking advantage of Hadoop. DUCC builds on base UIMA capabilities, providing many scale out options, built in performance metrics, and debugging support, all based on core UIMA components. The more complex a UIMA pipeline the bigger advantage to DUCC; for example, complex processing flows can be implemented directly in DUCC but would likely have to be transformed for map-reduce.
For those with sufficient Hadoop expertise, a relatively simple UIMA analytic can be easily integrated into an existing Hadoop shop without having to learn much about UIMA.