Search code examples
shellawkhdfs

How does awk work with directory of HDFS?


I want to combine the directory name of HDFS with awk. Does this workable? The directory name, not the file name. Here is my awk work fine in the local:

awk 'NR <= 1000 && FNR == 1{print FILENAME}' ./* 

And then I want to combine it with hadoop fs -ls like this:

hadoop fs -ls xxx/* | xargs awk 'NR <= 1000 && FNR == 1{print FILENAME}'

but show me: awk: cmd. line:2: fatal: cannot open file `-rwxrwxrwx' for reading (No such file or directory)

I also have tried like:

awk 'NR <= 1000 && FNR == 1{print FILENAME}' < hadoop fs -ls xxx/*
awk 'NR <= 1000 && FNR == 1{print FILENAME}' < $(hadoop fs -ls xxx/*)
awk 'NR <= 1000 && FNR == 1{print FILENAME}' $(hadoop fs -ls xxx/*)

These all failed without surprisingly, I consider awk execute file in the directory need read every file, not like the content of file that can pass it as streaming to awk. Am I right? Who can give me a workable solution to do that?

Thanks in advance.


Solution

  • It seems to me that you want to access files that are on a hadoop file-system. This is a virtual file-system, and you only have access to the meta-data of your file. If you want to operate on your file, it is then also important to first copy the file locally. This can be done using hadoop fs -get. After creating a local copy, you can start operating on the files. There is however an alternative way using hadoop fs -cat.

    Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here. The output of hadoop fs -ls is not similar to the standard output of the Unix/Linux command ls. It is closely related to ls -l and returns the following output:

    permissions number_of_replicas userid groupid filesize modification_date modification_time filename
    

    using this and piping it to awk we get a list of files that are of use. So we can now just setup a while-loop:

    c=0
    while read -r file; do
       [ $c -le 1000 ] && echo "${file}"
       nr=$(hadoop fs -cat "${file}" | wc -l)
       ((c+=nr))
    done < <(hadoop fs -ls xxx/* | awk '!/^d/{print substr($0,index($8,$0))}')
    

    note: your initial error was due to the non-unix-like output of hadoop fs -ls. The program awk received a filename -rwxrwxrwx which is actually a permission of the file itself.