Search code examples
javascalahadoopapache-sparkhdfs

How to get absolute paths in Hadoop Filesystem?


I would like to get a list of all files in a directory and its sub-directories in a HDFS filesystem. This is the method I've written for the purpose of recursively reading all the files in a directory:

def getAllFiles(dir: Path, fs: FileSystem, recursive: Boolean = true): Seq[Path] = {
  val iter = fs.listFiles(dir, recursive)
  val files = new ListBuffer[Path]()

  while (iter.hasNext()) {
    val p = iter.next().getPath
      files.append(p)
    }
    files
}

The result is a list of org.apache.hadoop.fs.Path elements which I need to process in the sub-sequent steps. Hence, I need to full path. My question is: what is the best way to get the full absolute path

So far, I use a recursive method to create the path string (Scala):

def fullPath(p: Path): String = {
  if (p.isRoot())
    p.getName
  else
    fullPath(p.getParent) + Path.SEPARATOR + p.getName
}

Is there no more straight-forward way through the Path API?

I've come across question #18034758, but using listFiles() rather than listStatus() seems to be the preferred way to recursively list files in a directory so the answer seems a bit cumbersome for this use case.


Solution

  • It may not be a good idea to rely on "toString". What if the definition of toString changes. I think it is better to do something like

    path.toUri().getRawPath()