Search code examples
azurehadoophdfsazure-blob-storagefile-encodings

How to determine file-encoding of file on hdfs (Azure blob storage)?


I've got a bunch of 100GB files on hdfs with mixed file-encodings (unfortunately in Azure blob storage). How can I determine the file encodings of each file? Some dfs command-line command would be ideal. Thanks.


Solution

  • I ended up getting the results I needed by piping the beginning of each file in blob storage to a local buffer and then applying the file unix utility. Here's what the command looks like for an individual file:

    hdfs dfs -cat wasb://container@account.blob.core.windows.net/path/to/file | head -n 10 > buffer; file -i buffer
    

    This gets you something like:

    buffer: text/plain; charset=us-ascii