Search code examples
hadoophivehdfsavrodistcp

HDFS intracluster copy with selected columns


I am using Avro files to store data in HDFS. I have a requirement to copy selected columns data from one avro file and send it to another location in the same cluster with its own schema file (that has the selected columns information). How can I do that? Is it possible to implement using Hive? or is there any utility in HDFS that can help me do that?

This is required because a group must be able to access an entire table and another group should be able to access only few columns. So, I need them to be in a separate location in HDFS with only the required schema and avro file.


Solution

  • There are multiple ways to create do this, I would say that the simplest are using Hive or Spark. In hive you can create a table using a reader schema (only with the fields that you want) and point the table location to your target directory. After that all that you need is insert from your source table selecting only the fields that you want into your reader table.

    Just as a comment, creating a reader schema is a very good solution to avoid data duplication in cases like this. If there is no strict requirement to create a subset of your data, I would suggest use reader schemas