Search code examples
pythonbashhadoop

hadoop fs -ls storing only the paths to a file


I am looking through a Hadoop file system. With the command

hadoop fs -ls /path/to/dir1*

I will look through every directory that starts with dir1 and return their files

The output will be something like

Found 1 items
-rw-r--r-- 3 sys_blah_blah  moredate /path/to/dir10/file1.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah  moredate /path/to/dir10/file2.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah  moredate /path/to/dir10/file3.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah  moredate /path/to/dir11/file1.py
Found 1 items
-rw-r--r-- 3 sys_blah_blah  moredate /path/to/dir11/file2.py
...

The only piece of information I am interested in is the path to file portion. How can I store only the paths into another file? Ideally, I would like an output of a file with only the paths within that file.

Initially, I thought about running the command and storing it's output into a file and then parse the new file and grab the paths with regex and placing them into a new file but that seems unnecessary.


Solution

  • You can make use of grep here since hadoop fs -ls hardly provides any useful options.

    hadoop fs -ls /path/to/dir1* | grep -oE "/(.*/)?" > outFile.dat
    

    If only one entry for each path is needed, just pass it to uniq. eg :

    hadoop fs -ls /path/to/dir1* | grep -oE "/(.*/)?" | uniq > outFile.dat
    

    Looks pretty simple.