Search code examples
pythonbashairflowhdfs

How to count the number of files and their size on a cluster?


How to count the number of files and their size on a common cluster if the files are created by different users? That is, one user created 10 files, the other 20, the size of the first is 2 GB, the second is 4 GB. What command in bash or python can be used to count?

allUsers = os.popen('cut -d: -f1 /user/hive/warehouse/yp.db').read().split('\n')[:-1]

for users in allUsers:

print(os.system('du -s /user/hive/warehouse/yp.db' + str(users)))

Solution

  • Preamble

    Going to make a few assumptions here.

    1. The file you are reading information from, yp.db, is most likely a NIS service map file. Usually those are kept in binary (BDB) format; if that's true, you need to either use db_dump to extract the information first, or (better) find the human readable source file it was created from. For the purpose of this question, I'll assume that you have that and your file is in a passwd-similar format (username:some:other:unrelated:fields:home_directory:shell) and is aptly called human_readable_users_file. Of note, the home directory is the 6th field.

    2. You are trying to list the size of each user's home directory, not some subdirectory which only contains "work" files.

    3. You have access to see all the user files. Depending on the setup, this may mean that (in likelihood order) you need to be root, or that you need to run this on a particular machine which gives you that access. If security is lax, you might just be able to run it as a regular user (that's the least likely one).

    One more thing before we start. Avoid using Python to do a bash job. It's inefficient and harder to maintain (people need to understand both). When you can, use bash only for system related things, and Python only for... well, Python things.

    Finding the user directory

    You can either read the home directories from the user database file if you know the format and it's stable, or ask bash to find it for you in the form ~username.

    We're going to do the former, because it's what you are already trying and it doesn't require parsing the user name field at all.

    Finding the total size per user

    du -sch $(cut -d: -f6 /user/hive/warehouse/human_readable_users_file)
    

    This will extract all home directories from human_readable_users_file (the 6th field separated by :) and feed them to a single du to print the size (the -s option). That enables du to also print a total count at the end (the -c option). Finally, it will print human readable sizes (the -h option), e.g. 45G instead of something like 46721185. You can remove the h if you are trying to use these numbers for exact calculations later.

    If you have too many users to fit in one command line (bash will complain), you will need to use du with the --files0-from option, which makes it read the list from stdin instead of passing it as command line arguments.

    Finding the total number of files

    You also mention finding the number of files. du does not do that, but you can use ls and wc -l in addition to du.

    for d in $(cut -d: -f6 /user/hive/warehouse/human_readable_users_file); do echo "$(ls -R1q home_dir | wc -l) $d"; done
    

    (As we are talking about home directories, the above assumes for simplicity that their names do not contain spaces, tabs, newlines or similar shenanigans.)

    Putting it together

    echo -e "File count\tSize\tDirectory"
    for d in $(cut -d: -f6 /user/hive/warehouse/human_readable_users_file); do
        echo -ne "$(ls -R1q $d | wc -l)\t"
        du -sh $d
    done