azure hadoop hdfs azure-blob-storage file-encodings

How to determine file-encoding of file on hdfs (Azure blob storage)?

I've got a bunch of 100GB files on hdfs with mixed file-encodings (unfortunately in Azure blob storage). How can I determine the file encodings of each file? Some dfs command-line command would be ideal. Thanks.

Solution

I ended up getting the results I needed by piping the beginning of each file in blob storage to a local buffer and then applying the file unix utility. Here's what the command looks like for an individual file:

hdfs dfs -cat wasb://container@account.blob.core.windows.net/path/to/file | head -n 10 > buffer; file -i buffer

This gets you something like:

buffer: text/plain; charset=us-ascii