Search code examples
cachingtalendimpala

difference between thashoutput/input and tbufferoutput/input in Talend


I don't clearly understand the difference between using tHash and tBuffer components in Talend.

I am looking at storing the result of a tMap in Impala table and also another copy in memory (cache) and perform other transformations on this to finally write to a table.


Solution

  • They can be used for similar purposes but there's a few distinct differences between the hash and the buffer components.

    They both work by storing the result set in memory but the hash components allow you to store multiple hash objects and retrieve specific hash sets. This can be useful if you need to temporarily store multiple result sets and then join them back in some way for example transforming multiple data sources and then writing the data out in a single entry to your target. You can also append the output of one hash to another to write to the same data set.

    The buffer components only have a single append only option where multiple buffer outputs will write into the same, shared buffer. This makes it less flexible than the hash components but can still be useful for many tasks.

    What the buffer components offer extra over the hash components is that the buffer can be read by parent jobs to send data back up to the calling parent job. This same mechanism is also used if you want to deploy your Talend job as a web service and return data from it as shown in this tutorial.

    Other options in a similar space but more for when you start dealing with amounts of data that can't be processed in memory easily (but need to be fully contained in memory for some reason rather than being iterated on) are to use the tCache family of components that I know a few other posters here quite like (although I have yet to need). This works like the hash components but will also spill to disk if needed.

    An embedded H2 database could also be ran in memory to provide a similar effect and quite a lot more options but at the added cost of complexity in your job.