I've got a bunch of 100GB files on hdfs with mixed file-encodings (unfortunately in Azure blob storage). How can I determine the file encodings of each file? Some dfs command-line command would be ideal. Thanks.
I ended up getting the results I needed by piping the beginning of each file in blob storage to a local buffer and then applying the file
unix utility. Here's what the command looks like for an individual file:
hdfs dfs -cat wasb://container@account.blob.core.windows.net/path/to/file | head -n 10 > buffer; file -i buffer
This gets you something like:
buffer: text/plain; charset=us-ascii